wrangling

Andrew W. Park & John M. Drake

Reading in data (files and objects)

Files: You've already seen the main command for reading in data from files -> read.csv

We can also use read_csv in much the same way. It's slightly faster and the resulting data frame is recognized as a tibble (more info: https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)

Reading in data (files and objects)

Objects: We can read previously saved objects (incl. data frames) with the load command

For example, if you created a data frame called df that took a lot of time (e.g. from some slow simulation) then you could first save it:

  • save(df,file='get_df.Rda')

..and later, load it again (e.g. when you start a new session and it's no longer in memory)

  • load('get_df.Rda')

The file extension .Rda is short for Rdata (and you can use .Rdata as the extension, if you prefer)

The 'tidy data' format

It is good practice to work with 'tidy data':

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

This involves a few commands from the dplyr package, particularly:

  • gather
  • select
  • mutate

Combining data sets

gourmet <- tibble(
  state=c("GA","CA","NC","TX"),
  sauce=c("sweetBBQ","hot","tangyBBQ","smokyBBQ")
)

greeting <- tibble(
  state=c("TX","GA","CA","NY"),
  word=c("howdy","hey y'all","'sup","hello")
)

inner_join(gourmet,greeting)
# A tibble: 3 × 3
  state    sauce      word
  <chr>    <chr>     <chr>
1    GA sweetBBQ hey y'all
2    CA      hot      'sup
3    TX smokyBBQ     howdy

Obtaining summary information

set.seed(123)
grades <- tibble(
  student_ID=as.character(1:10),
  major=sample(c("E","B"),10,replace=T),
  U_G=sample(c("U","G"),10,replace=T),
  score=sample(60:100,10,replace=T)
) %>% print
# A tibble: 10 × 4
   student_ID major   U_G score
        <chr> <chr> <chr> <int>
1           1     E     G    96
2           2     B     U    88
3           3     E     G    86
4           4     B     G   100
5           5     B     U    86
6           6     E     G    89
7           7     B     U    82
8           8     B     U    84
9           9     B     U    71
10         10     E     G    66

Obtaining summary information

grades %>% group_by(major) %>% summarize(avScore=mean(score))
# A tibble: 2 × 2
  major  avScore
  <chr>    <dbl>
1     B 85.16667
2     E 84.25000