class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L13
tidymodels
suite ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2022-12-16 ] --- class: center, middle
<img src="imgs/tidyverse.png" width="85%" style="display: block; margin: auto;" /> --- ### >> 为什么要在
中进行统计建模分析
-- <br> .pull-left.font150.bold[ 💪 R 和 R 包大多是由统计学家和数据分析专家开发的 💪 R 有着异常丰富且紧跟前沿的 [.red[{{统计建模和机器学习包}}]](https://cran.r-project.org/web/views/) 💪 R 容易接口或调用其他应用软件(如 C++、python、Spark、TensorFlow 等) ] -- .pull-right[ <img src="imgs/MLwR.png" width="80%" style="display: block; margin: auto;" /> ] --- ### >> 在
中进行统计建模分析面临哪些问题
-- <br> .pull-left.font150.bold[ 😰 作为解析型的数据分析语言,R 的运行效率低于 C、C++、JAVA 等编译语言 😰 R 操作数据的规模往往受限于计算机内存的大小 .red[😰 各式 R 包的接口并不统一] ] -- .pull-right.code80[ |Function |Package |Code | |:------------|:------------|:--------------------------------------------| |`lda` |`MASS` |`predict(obj)` | |`glm` |`stats` |`predict(obj, type = "response")` | |`gbm` |`gbm` |`predict(obj, type = "response", n.trees)` | |`rpart` |`rpart` |`predict(obj, type = "prob")` | |`Weka` |`RWeka` |`predict(obj, type = "probability")` | ] -- <br> .right.bold.font120[ [`caret`](https://github.com/topepo/caret/) -> .font120[👉 .red[[`tidymodels`!](https://www.tidymodels.org/)]]<br> [`mlr`](https://mlr.mlr-org.com/) -> [`mlr3`!](https://mlr3.mlr-org.com/) ] --- layout: false <img src="imgs/ds.png" width="40%" style="display: block; margin: auto auto auto 0;" /> <hr> <img src="imgs/tidymodels-flow.svg" width="100%" style="display: block; margin: auto;" /> --- class: inverse, center, middle, hide_logo background-image: url(imgs/logo-rsample.svg) background-size: 4em background-position: 12% 45% # 1. 数据重采样 -> `rsample` .font60[<sup>v1.1.1</sup>] .font150[(_General Resampling Infrastructure_)] --- layout: true ### >> 数据重采样 -> `rsample` 包 .font80[<sup>v1.1.1</sup>] --- .code90[ ```r # 先安装并加载“一包顶N包”的 tidymodels library(tidymodels) # v1.0.0 ``` ``` #> ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ── ``` ``` #> ✔ broom 1.0.1 ✔ recipes 1.0.3 #> ✔ dials 1.1.0 ✔ rsample 1.1.1 #> ✔ dplyr 1.0.10 ✔ tibble 3.1.8 #> ✔ ggplot2 3.4.0 ✔ tidyr 1.2.1 #> ✔ infer 1.0.4 ✔ tune 1.0.1 #> ✔ modeldata 1.0.1 ✔ workflows 1.1.2 #> ✔ parsnip 1.0.3 ✔ workflowsets 1.0.0 #> ✔ purrr 0.3.5 ✔ yardstick 1.1.0 ``` ``` #> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Use suppressPackageStartupMessages() to eliminate package startup messages ``` ] --- .pull-left.code95[ ```r # 设定随机数种子 set.seed(1234) # 随机抽样分成训练组和试验组 iris_split <- initial_split(iris) # initial_split(data, prop = 3/4, # strata = NULL, # breaks = 4, # pool = 0.1, ...) ``` ```r # 查看分组情况 iris_split ``` ``` #> <Training/Testing/Total> #> <112/38/150> ``` ```r # 查看iris_split的类 iris_split %>% class() ``` ``` #> [1] "initial_split" "mc_split" #> [3] "rsplit" ``` ] -- .pull-right.code85[ ```r # 查看 rsplit 类的底层结构 iris_split %>% str(vec.len = 1.5) ``` ``` *#> List of 4 *#> $ data :'data.frame': 150 obs. of 5 variables: #> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ... #> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ... #> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ... #> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ... #> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ... *#> $ in_id : int [1:112] 28 80 101 111 ... #> $ out_id: logi NA #> $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame) #> ..$ id: chr "Resample1" #> - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ... ``` ```r # 提取训练组样本(Q:测试组样本用哪个函数?) iris_split %>% training() %>% glimpse(width = 40) ``` ``` #> Rows: 112 #> Columns: 5 #> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.… #> $ Sepal.Width <dbl> 3.5, 2.6, 3.3, 3.… #> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.… #> $ Petal.Width <dbl> 0.2, 1.0, 2.5, 2.… #> $ Species <fct> setosa, versicolo… ``` ] --- layout: false class: inverse, center, middle, hide_logo background-image: url(imgs/logo-recipes.svg) background-size: 4em background-position: 12% 45% # 2. 数据预处理 -> `recipes` .font60[<sup>v1.0.3</sup>] .font120[(_Preprocessing and Feature Engineering Steps for Modeling_)] --- ### >> 2.1 定义菜单 -- .pull-left.code85[ ```r iris_recipe <- recipe(Species ~ ., data = iris) %>% step_corr(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) iris_recipe # 复杂的recipe类,View()它! ``` ``` #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 4 #> #> Operations: #> #> Correlation filter on all_predictors() #> Centering for all_predictors() #> Scaling for all_predictors() ``` ] -- .pull-right.font100[ .bold.info[`recipes` 包中共计有 97 个数据预处理步骤函数,示例如下:] .code80[ ``` [1] "step_arrange" "step_bagimpute" [3] "step_bin2factor" "step_BoxCox" [5] "..." "..." [7] "step_unorder" "step_window" [9] "step_YeoJohnson" "step_zv" ``` ] .bold.info[你可以用以下五种方式(及其组合)在 `step_*()` 中指定拟作用的变量:] - 直接使用变量名枚举相关变量 - 以 `dplyr` 包的方式,如 `starts_with()` - 以 `all_outcomes()`、`all_predictors()` 或 `has_role()` 指定角色的方式 - 以 `all_numeric()`、`all_nominal()` 或 `has_type()` 指定类型的方式 - 以指定类型+角色复合的方式,如 `all_numeric_predictors()` ] --- ### >> 2.2 准备菜单 -- .pull-left.code90[ ```r iris_recipe <- iris_recipe %>% prep(training(iris_split), verbose = TRUE) ``` ``` #> oper 1 step corr [training] #> oper 2 step center [training] #> oper 3 step scale [training] #> The retained training set is ~ 0 Mb in memory. ``` ```r # prep(x, # training = NULL, # fresh = FALSE, # verbose = FALSE, # retain = TRUE, # log_change = FALSE, # strings_as_factors = TRUE, # ...) ``` ] -- .pull-right.code90[ ```r iris_recipe ``` ``` #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 4 #> #> Training data contained 112 data points and no missing data. #> #> Operations: #> *#> Correlation filter on Petal.Length [trained] *#> Centering for Sepal.Length, Sepal.Width, Petal.Width [trained] *#> Scaling for Sepal.Length, Sepal.Width, Petal.Width [trained] ``` ] --- ### >> 2.3 应用菜单 -- .pull-left.code90[ ```r # 训练组 # 可直接提取准备阶段 retain 的数据 iris_training <- iris_recipe %>% * bake(new_data = NULL) # juice() # bake(object, # new_data, # ..., # composition = "tibble") iris_training ``` ``` #> # A tibble: 112 × 4 #> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.805 1.09 -1.36 setosa #> 2 -0.212 -1.07 -0.316 versicolor #> 3 0.500 0.607 1.64 virginica #> # … with 109 more rows ``` ] -- .pull-right.code85[ ```r # 测试组 # 用 bake() 进行预处理 iris_testing <- iris_recipe %>% * bake(new_data = testing(iris_split)) iris_testing ``` ``` #> # A tibble: 38 × 4 #> Sepal.Length Sepal.Width Petal.Width Species #> <dbl> <dbl> <dbl> <fct> #> 1 -0.924 1.09 -1.36 setosa #> 2 -1.52 0.847 -1.23 setosa #> 3 -1.28 0.847 -1.36 setosa #> # … with 35 more rows ``` ] --- layout: false class: inverse, center, middle, hide_logo background-image: url(imgs/logo-parsnip.svg) background-size: 4em background-position: 12% 45% # 3. 模型训练 -> `parsnip` .font60[<sup>v1.0.3</sup>] .font130[(_A Common API to Modeling and Analysis Functions_)] --- layout: true ### >> 3.1 模型设定 --- -- .pull-left.code90[ ```r # 随机森林模型的统一 API iris_rf <- rand_forest( mode = "classification", trees = 100 ) # rand_forest( # mode = "unknown", # engine = "ranger", # mtry = NULL, trees = NULL, # min_n = NULL) # 设定随机森林模型的计算引擎 # show_engines("rand_forest") (iris_rf <- iris_rf %>% set_engine("ranger")) ``` ``` #> Random Forest Model Specification (classification) #> #> Main Arguments: #> trees = 100 #> #> Computational engine: ranger ``` ] -- .pull-right.code85[ ```r # 查看 iris_rf 的类 iris_rf %>% class() ``` ``` #> [1] "rand_forest" "model_spec" ``` ```r # 查看 iris_rf 的底层结构 iris_rf %>% str() ``` ``` #> List of 7 #> $ args :List of 3 #> ..$ mtry : language ~NULL #> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ trees: language ~100 #> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> ..$ min_n: language ~NULL #> .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> #> $ eng_args : Named list() #> ..- attr(*, "class")= chr [1:2] "quosures" "list" #> $ mode : chr "classification" #> $ user_specified_mode : logi TRUE #> $ method : NULL #> $ engine : chr "ranger" #> $ user_specified_engine: logi TRUE #> - attr(*, "class")= chr [1:2] "rand_forest" "model_spec" ``` ] --- layout: true ### >> 3.2 模型拟合 --- .pull-left.code85[ ```r # 拟合随机森林模型 (iris_fit <- iris_rf %>% fit( # parsnip::fit.model_spec() formula = Species ~ ., data = iris_training )) ``` ``` #> parsnip model object #> #> Ranger result #> #> Call: #> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 100 #> Sample size: 112 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.0563 ``` ] -- .pull-right.code90[ ```r # 查看 iris_fit 的类 iris_fit %>% class() ``` ``` #> [1] "_ranger" "model_fit" ``` ```r # 查看 iris_fit 的底层结构 iris_fit %>% str(max.level = 1) ``` ``` #> List of 5 #> $ lvl : chr [1:3] "setosa" "versicolor" "virginica" *#> $ spec :List of 8 #> ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec" *#> $ fit :List of 13 #> ..- attr(*, "class")= chr "ranger" #> $ preproc:List of 4 #> $ elapsed:List of 1 #> - attr(*, "class")= chr [1:2] "_ranger" "model_fit" ``` ] --- layout: true ### >> 3.3 模型预测 --- .pull-left.code90[ ```r # 将拟合模型应用于测试组 iris_fit %>% predict( # parsnip::predict.model_fit() new_data = iris_testing ) ``` ``` #> # A tibble: 38 × 1 #> .pred_class #> <fct> #> 1 setosa #> 2 setosa #> 3 setosa #> # … with 35 more rows ``` ```r # predict(object, new_data, # type = NULL, # # "numeric", "class", "prob", # # "conf_int", "pred_int", # # "quantile", # opts = list(), ...) ``` ] -- .pull-right.code95[ ```r # 合并原数据集 iris_class <- iris_fit %>% predict( new_data = iris_testing ) %>% * bind_cols(iris_testing) iris_class %>% glimpse(width = 40) ``` ``` #> Rows: 38 #> Columns: 5 #> $ .pred_class <fct> setosa, setosa, s… #> $ Sepal.Length <dbl> -0.9241, -1.5176,… #> $ Sepal.Width <dbl> 1.087, 0.847, 0.8… #> $ Petal.Width <dbl> -1.359, -1.229, -… #> $ Species <fct> setosa, setosa, s… ``` ] --- layout: false class: inverse, center, middle, hide_logo background-image: url(imgs/logo-yardstick.svg) background-size: 4em background-position: 12% 45% # 4. 模型表现 -> `yardstick` .font60[<sup>v1.1.0</sup>] .font130[(_Tidy Characterizations of Model Performance_)] --- layout: true ### >> 4.1 模型表现:`metrics()` --- .pull-left.code90[ ```r iris_class %>% glimpse(width = 40) ``` ``` #> Rows: 38 #> Columns: 5 #> $ .pred_class <fct> setosa, setosa, s… #> $ Sepal.Length <dbl> -0.9241, -1.5176,… #> $ Sepal.Width <dbl> 1.087, 0.847, 0.8… #> $ Petal.Width <dbl> -1.359, -1.229, -… #> $ Species <fct> setosa, setosa, s… ``` ```r iris_class %>% * metrics( # 计算模型表现指标 * truth = Species, * estimate = .pred_class * ) ``` ``` #> # A tibble: 2 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy multiclass 0.947 #> 2 kap multiclass 0.920 ``` ] -- .pull-right.code85[ ```r # 提取各分类的预测概率 iris_probs <- iris_fit %>% predict( new_data = iris_testing, * type = "prob" ) %>% bind_cols(iris_class) # 模型表现指标:基于分类预测概率 iris_probs %>% metrics( truth = Species, estimate = .pred_class, * .pred_setosa:.pred_virginica ) ``` ``` #> # A tibble: 4 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy multiclass 0.947 #> 2 kap multiclass 0.920 *#> 3 mn_log_loss multiclass 0.199 *#> 4 roc_auc hand_till 0.979 ``` ] --- layout: true ### >> 4.2 模型表现:`*_curve()` --- .pull-left.code90[ ```r # 接收器操作特征曲线 iris_roc <- iris_probs %>% roc_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_roc -> a "roc_df" iris_roc %>% autoplot() ``` <img src="L13_tidymodels_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right.code90[ ```r # 增益曲线 iris_gain <- iris_probs %>% * gain_curve( truth = Species, .pred_setosa:.pred_virginica ) # iris_gain # a "gain_df" iris_gain %>% autoplot() ``` <img src="L13_tidymodels_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] --- layout: false class: inverse, center, middle # 5. 一个较复杂的案例 .font150[(_Case Study_)] --- ### >> 5.0 数据 .code90[ ```r library(AmesHousing) (ames <- make_ames() %>% dplyr::select(-matches("Qu"))) ``` ``` #> # A tibble: 2,930 × 74 #> MS_SubClass MS_Zo…¹ Lot_F…² Lot_A…³ Street Alley Lot_S…⁴ Land_…⁵ Utili…⁶ #> <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fct> <fct> #> 1 One_Story_1… Reside… 141 31770 Pave No_A… Slight… Lvl AllPub #> 2 One_Story_1… Reside… 80 11622 Pave No_A… Regular Lvl AllPub #> 3 One_Story_1… Reside… 81 14267 Pave No_A… Slight… Lvl AllPub #> # … with 2,927 more rows, 65 more variables: Lot_Config <fct>, #> # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, #> # Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>, #> # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>, #> # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, #> # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, #> # Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, … ``` ] --- ### >> 5.1 数据重采样 -> `rsample` 包 .pull-left.code90[ <img src="imgs/resampling.svg" width="100%" style="display: block; margin: auto;" /> ```r set.seed(1234) data_split <- initial_split( ames, strata = "Sale_Price") ames_train <- training(data_split) ames_test <- testing(data_split) nrow(ames_train) / nrow(ames) ``` ``` #> [1] 0.75 ``` ] -- .pull-right.code85[ ```r # v折交叉验证 (cv_splits <- vfold_cv(ames_train)) ``` ``` #> # 10-fold cross-validation #> # A tibble: 10 × 2 #> splits id #> <list> <chr> #> 1 <split [1977/220]> Fold01 #> 2 <split [1977/220]> Fold02 #> 3 <split [1977/220]> Fold03 #> # … with 7 more rows ``` ```r cv_splits$splits[[1]] ``` ``` #> <Analysis/Assess/Total> #> <1977/220/2197> ``` ```r cv_splits$splits[[1]] %>% analysis() %>% dim() ``` ``` #> [1] 1977 74 ``` `vignette("Common_Patterns", package = "rsample")` ] --- layout: true ### >> 5.2 数据预处理 -> `recipes` 包 --- ```r # 1. 定义数据预处理菜单 ames_rec <- recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude, data = ames_train) %>% step_log(Sale_Price, base = 10) %>% step_BoxCox(Lot_Area, Gr_Liv_Area) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_bs(Longitude, Latitude, options = list(df = 5)) ames_rec %>% tidy() ``` ``` #> # A tibble: 6 × 6 #> number operation type trained skip id #> <int> <chr> <chr> <lgl> <lgl> <chr> #> 1 1 step log FALSE FALSE log_YNFHa #> 2 2 step BoxCox FALSE FALSE BoxCox_vCuOq #> 3 3 step other FALSE FALSE other_0nogs #> # … with 3 more rows ``` --- .pull-left.code85[ ```r # 2. 准备数据预处理 cv_splits <- cv_splits %>% mutate( recipes = map(splits, prepper, recipe = ames_rec) ) cv_splits ``` ``` #> # 10-fold cross-validation #> # A tibble: 10 × 3 #> splits id recipes #> <list> <chr> <list> #> 1 <split [1977/220]> Fold01 <recipe> #> 2 <split [1977/220]> Fold02 <recipe> #> 3 <split [1977/220]> Fold03 <recipe> #> # … with 7 more rows ``` ```r # prepper # function(split_obj, recipe, ...) { # prep(recipe, # training = rsample::analysis(split_obj), # fresh = TRUE, ...) # } ``` ] -- .pull-right.code95[ ```r # 相同的菜单 + # 不同的训练数据集 -> # 不同的准备结果 cv_splits$recipes %>% map_dfr(tidy, number = 2) %>% filter(terms == "Lot_Area") ``` ``` #> # A tibble: 10 × 3 #> terms value id #> <chr> <dbl> <chr> #> 1 Lot_Area 0.0836 BoxCox_vCuOq #> 2 Lot_Area 0.0777 BoxCox_vCuOq #> 3 Lot_Area 0.104 BoxCox_vCuOq #> # … with 7 more rows ``` ```r # recipes:::tidy.recipe ``` ] --- layout: true ### >> 5.3 拟合模型 -> `parsnip` 包 --- -- .pull-left.code85[ ```r # 1. 设定模型 knn_mod <- nearest_neighbor( mode = "regression", neighbors = 5) %>% set_engine("kknn") # 默认,可不设置 # 2. 拟合模型 # 2.1 定义辅助函数 parsnip_fit <- function(rec, model) { fit(model, Sale_Price ~ ., data = bake(rec, new_data= NULL)) } # 2.2 应用辅助函数 cv_splits <- cv_splits %>% mutate( knn = map( recipes, parsnip_fit, model = knn_mod ) ) ``` ] -- .pull-right.code85[ ```r # 查看结果 cv_splits %>% select(-id) ``` ``` #> # A tibble: 10 × 3 #> splits recipes knn #> <list> <list> <list> #> 1 <split [1977/220]> <recipe> <fit[+]> #> 2 <split [1977/220]> <recipe> <fit[+]> #> 3 <split [1977/220]> <recipe> <fit[+]> #> # … with 7 more rows ``` ```r cv_splits$knn[[1]] %>% str(max.level = 1) ``` ``` #> List of 5 #> $ lvl : NULL #> $ spec :List of 7 #> ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec" #> $ fit :List of 10 #> ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn" #> $ preproc:List of 1 #> $ elapsed:List of 1 #> - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit" ``` ] --- layout: true ### >> 5.4 评估模型表现 -> `parsnip` 包 + `yardstick` 包 --- -- .pull-left.code80[ ```r # 定义模型表现函数 pf_metrics <- metric_set(rmse, rsq, ccc) # 定义辅助函数 parsnip_metrics <- function(split, recipe, model) { raw <- assessment(split) processed <- bake(recipe, new_data = raw) model %>% predict(new_data = processed) %>% bind_cols(processed) %>% pf_metrics(Sale_Price, .pred) } # 应用辅助函数 cv_splits <- cv_splits %>% mutate( metrics = pmap( list( split = splits, recipe = recipes, model = knn ), parsnip_metrics ) ) ``` ] -- .pull-right.code85[ ```r cv_splits$metrics[[1]] ``` ``` #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.0839 #> 2 rsq standard 0.761 #> 3 ccc standard 0.871 ``` ```r # 汇总计算模型表现指标 cv_splits %>% unnest(cols = metrics) %>% group_by(.metric) %>% summarise( mean = mean(.estimate), std_err = sd(.estimate)/sqrt(n()) ) ``` ``` #> # A tibble: 3 × 3 #> .metric mean std_err #> <chr> <dbl> <dbl> #> 1 ccc 0.869 0.00519 #> 2 rmse 0.0870 0.00245 #> 3 rsq 0.764 0.00889 ``` ] --- layout: true ### >> 5.5 模型调参:手动模式 --- <br> <img src="imgs/TrainAlgo.svg" width="100%" style="display: block; margin: auto;" /> --- .pull-left.code80[ ```r # 定义辅助函数 m_metrics <- function(split, recipe, fit) { holdout <- bake( recipe, new_data = assessment(split)) resample_name <- labels(split)$id multi_predict(fit, new_data = holdout, neighbors = 1:50) %>% bind_cols(., holdout %>% select(Sale_Price)) %>% unnest(cols = .pred) %>% group_by(neighbors) %>% pf_metrics(Sale_Price, .pred) %>% mutate(resample = resample_name) } # 应用辅助函数 cv_splits <- cv_splits %>% mutate( tuning_metrics = pmap( list(splits, recipes, knn), m_metrics ) ) ``` ] -- .pull-right.code85[ ```r cv_splits$tuning_metrics[[1]] ``` ``` #> # A tibble: 150 × 5 #> neighbors .metric .estimator .estimate resample #> <int> <chr> <chr> <dbl> <chr> #> 1 1 rmse standard 0.101 Fold01 #> 2 2 rmse standard 0.0951 Fold01 #> 3 3 rmse standard 0.0898 Fold01 #> # … with 147 more rows ``` ```r # 汇总计算模型表现指标 (rs_rmse <- cv_splits %>% unnest(tuning_metrics) %>% filter(.metric == "rmse") %>% group_by(neighbors) %>% summarize(rmse = mean(.estimate))) ``` ``` #> # A tibble: 50 × 2 #> neighbors rmse #> <int> <dbl> #> 1 1 0.104 #> 2 2 0.0972 #> 3 3 0.0921 #> # … with 47 more rows ``` ] --- .pull-left.code85[ ```r (min_rmse <- rs_rmse %>% arrange(rmse) %>% slice(1)) ``` ``` #> # A tibble: 1 × 2 #> neighbors rmse #> <int> <dbl> #> 1 11 0.0837 ``` ```r ggplot(rs_rmse, aes(x = neighbors, y = rmse)) + geom_point() + geom_path() + geom_point(data = min_rmse, size = 3, color = "red") ``` <img src="L13_tidymodels_files/figure-html/unnamed-chunk-42-1.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right.code90[ ```r # 最终模型 ames_rec_final <- prep(ames_rec) knn_mod_final <- update(knn_mod, neighbors = 11) %>% fit(Sale_Price ~ ., data = bake(ames_rec_final, new_data = NULL)) # 评估最终模型在测试集上的表现 parsnip_metrics(data_split, ames_rec_final, knn_mod_final) ``` ``` #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.0754 #> 2 rsq standard 0.813 #> 3 ccc standard 0.894 ``` ] --- layout: true ### >> 5.6 模型调参:`workflows` .font80[<sup>v1.1.2</sup>] + `tune` .font80[<sup>v1.0.1</sup>] --- .pull-left[ ```r # 重新定义模型:设定需调校的参数 knn_mod <- nearest_neighbor( mode = "regression", * neighbors = tune()) %>% set_engine("kknn") # 定义流程 ames_wf <- workflow() %>% add_model(knn_mod) %>% add_recipe(ames_rec) # 基于参数网格计算模型表现指标 ames_res <- ames_wf %>% * tune_grid( cv_splits %>% select(splits, id), * grid = tibble(neighbors = 1:50), metrics = metric_set(rmse, rsq, ccc) ) ``` ] -- .pull-right[ ```r ames_res ``` ``` #> # Tuning results #> # 10-fold cross-validation #> # A tibble: 10 × 4 #> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]> #> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]> #> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]> #> # … with 7 more rows #> #> There were issues with some computations: #> #> - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases #> #> Run `show_notes(.Last.tune.result)` for more information. ``` ```r ames_res %>% collect_metrics() %>% glimpse(width = 40) ``` ``` #> Rows: 150 #> Columns: 7 #> $ neighbors <int> 1, 1, 1, 2, 2, 2, 3… #> $ .metric <chr> "ccc", "rmse", "rsq… #> $ .estimator <chr> "standard", "standa… #> $ mean <dbl> 0.8276, 0.1042, 0.6… #> $ n <int> 10, 10, 10, 10, 10,… #> $ std_err <dbl> 0.00943, 0.00302, 0… #> $ .config <chr> "Preprocessor1_Mode… ``` ] --- ```r ames_res %>% * collect_metrics() %>% filter(.metric == "rmse") %>% ggplot(aes(x = neighbors, y = mean)) + geom_point() + geom_line() + geom_point(size = 2, color = 'red', data = . %>% slice_min(mean)) + ylab("rmse") ``` <img src="L13_tidymodels_files/figure-html/unnamed-chunk-45-1.png" width="60%" style="display: block; margin: auto;" /> --- ```r (best_knn <- ames_res %>% select_best("rmse")) ``` ``` #> # A tibble: 1 × 2 #> neighbors .config #> <int> <chr> #> 1 11 Preprocessor1_Model11 ``` ```r ames_fit_test <- ames_wf %>% * finalize_workflow(best_knn) %>% * last_fit( split = data_split, metrics = metric_set(rmse, rsq, ccc) ) ames_fit_test %>% glimpse() ``` ``` #> Rows: 1 #> Columns: 6 #> $ splits <list> [<initial_split[2197 x 733 x 2930 x 74]>] #> $ id <chr> "train/test split" #> $ .metrics <list> [<tbl_df[3 x 4]>] #> $ .notes <list> [<tbl_df[1 x 3]>] #> $ .predictions <list> [<tbl_df[733 x 4]>] #> $ .workflow <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year… ``` --- .pull-left[ ```r ames_fit_test %>% collect_metrics() ``` ``` #> # A tibble: 3 × 4 #> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.0754 Preproce… #> 2 rsq standard 0.813 Preproce… #> 3 ccc standard 0.894 Preproce… ``` ] -- .pull-right[ ```r ames_fit_test %>% collect_predictions() %>% ggplot(aes(Sale_Price, .pred)) + geom_abline(lty = 2, size = 1.5, color = "gray50") + geom_point(alpha = 0.5) + coord_fixed() ``` <img src="L13_tidymodels_files/figure-html/unnamed-chunk-49-1.png" width="100%" /> ] --- layout: false ### >> `tidymodels` 家族! <img src="imgs/tidymodels-core.png" width="80%" style="display: block; margin: auto auto auto 0;" /> -- <img src="imgs/tidymodels-more.png" width="100%" style="display: block; margin: auto;" /> --- layout: false class: inverse, center, middle # 课后作业 --- class: middle .font150[ 1. 在 📑 _Rmd_ 中键入并运行第13讲的代码(自觉完成,无需提交) 1. 登录 .red[[{{`tidymodels`}}](https://www.tidymodels.org/)] 的主页 🏠,浏览并学习相关资料 1. 3 本 `tidymodels` 核心团队撰写的 书 📖,供感兴趣的同学学习参考: - [Tidy Modeling with R](https://www.tmwr.org/) - [Feature Engineering and Selection](https://bookdown.org/max/FES/) - [Supervised Machine Learning for Text Analysis in R](https://smltar.com/) ] --- class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]