Dplyr

data science

Dplyr is a data transformation library in the collection Tidyverse.

Data Wrangling with dplyr

Load dplyr

library(dplyr)

The dplyr package uses the pipe-command to pass the result of one transformation to the next:

dataset %>%
    filter(column1 == "Value") %>%
    arrange(column2)

Filter

filter(column1 == "Value")

Order / Sort

arrange(col1) %>% # ascending
arrange(desc(col2)) # descending

Mutate / Change / Add columns

mutate(resultCol = col / 1000)

Aggregation

Summarize / aggregate

dataset %>%
summarize(medianCol1 = median(col1))

Aggregation functions
sum	`sum()`
mean	`mean()`
median	`median()`
minimum / maximum	`min()` / `max()`
first / last position	`first()` / `last()`
counts	`n()` / `n_distinct()`

Group-by

dataset %>%
    group_by(year_col, continent_col) %>% 
    summarize(mean(gpd_col))

aggregates only for the groups defined before

Combine tables

Stack horizontally (new column)

dataset %>%
    bind_cols(new_dataset)

Join tables

dataset_1 %>% left_join(dataset_2, by = join_by(col1 == col2), relationship = "one-to-one") 
... %>% right_join(...)
... %>% inner_join(...) # only keeps matching samples
... %>% full_join(...) # keeps all samples in both datasets

relationships checks can be: "one-to-one", "one-to-many", "many-to-one" and "many-to-many" (does not make a check)