Dplyr
language
data science
Dplyr is a data transformation library in the collection Tidyverse.
Data Wrangling with dplyr
- Load dplyr
-
library(dplyr)
The dplyr package uses the pipe-command to pass the result of one transformation to the next:
%>%
dataset filter(column1 == "Value") %>%
arrange(column2)
- Filter
-
filter(column1 == "Value")
- Order / Sort
-
arrange(col1) %>% # ascending arrange(desc(col2)) # descending
- Mutate / Change / Add columns
-
mutate(resultCol = col / 1000)
Aggregation
- Summarize / aggregate
-
%>% dataset summarize(medianCol1 = median(col1))
Aggregation functions | |
---|---|
sum | sum() |
mean | mean() |
median | median() |
minimum / maximum | min() / max() |
first / last position | first() / last() |
counts | n() / n_distinct() |
Group-by
%>%
dataset group_by(year_col, continent_col) %>%
summarize(mean(gpd_col))
aggregates only for the groups defined before
Combine tables
Stack horizontally (new column)
%>%
dataset bind_cols(new_dataset)
- Join tables
-
%>% left_join(dataset_2, by = join_by(col1 == col2), relationship = "one-to-one") dataset_1 %>% right_join(...) ... %>% inner_join(...) # only keeps matching samples ... %>% full_join(...) # keeps all samples in both datasets ...
relationships checks can be: "one-to-one"
, "one-to-many"
, "many-to-one"
and "many-to-many"
(does not make a check)