Tidymodels

data science

Tidymodels is a collection of packages for machine learning. It is part of the tidyverse.

load packages:

library(tidymodels)

Rsample (sampling & splitting)

training- & test-split

data_split <- initial_split(dataset, prop=0.75, strata=target_col)
training_set <- data_split %>% training()
testing_set <- data_split %>% testing()

strata generates a stratified split with approximately the same fraction of each target class.

Recipes (feature enigneering)

Parsnip (model fitting)

General model formula

outcome_variable ~ predictor_1 + predictor_2 + ... 
outcome_variable ~ . # to use all available predictors

Create model object

model <- linear_reg() %>%
set_engine('lm') %>% 
set_mode('regression')

Fit the model

model_fit <- model %>% 
fit(target_col ~ ., data=train_set)

Get model summary

tidy(model_fit)

Returns parameter estimates, std.errors & p-values

Predict on new values

predictions <- model_fit %>%
predict(new_data = test_set)

Tune & Dials (hyper-)parameter optimization)

Yardstick (performance evaluation)

Requires a tibble/datset with the true and predicted outcomes.

predictions %>%
    rmse(truth = label_col, estimate = .pred) # .pred is the standard name of the prediction col

Regression quality metrics
R squared	`prediction_set %>% rsq(truth = ..., estimate = ...)`
Root mean squared error	`... rmse()`

Classification quality metrics
accuracy	`prediction_set %>% accuracy(truth = ..., estimate = ...)`
balanced accuracy	`... bal_accuracy()`
precision	`... precision()`
recall	`... recall()`
sensitivity	`... sensitivity()`
specificity	`... specificity()`
area under the curve	`... roc_auc()`

More metrics here.

Streamlined approach

# Fit model:
model_last_fit <- model %>% 
last_fit(target_col ~ ., 
    split = data_split)
# Return standard quality metrics:
model_last_fit %>%
collect_metrics() 
# Return tibble with predictions and true target values:
model_last_fit %>% 
collect_predictions()