第13讲 tidymodels机器学习

class: center, middle, inverse, title-slide

.title[
# 量化金融与金融编程
]
.subtitle[
## L13 <code>tidymodels</code> suite
]
.author[
### <br>曾永艺
]
.institute[
### 厦门大学管理学院
]
.date[
### <br>2022-12-16
]

---

class: center, middle

<div>
<style type="text/css">.xaringan-extra-logo {
width: 50px;
height: 50px;
z-index: 0;
background-image: url(imgs/logo-sm.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:0.5em;right:0.5em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

### >> 为什么要在 <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> 中进行统计建模分析 <svg aria-hidden="true" role="img" viewBox="0 0 320 512" style="height:1em;width:0.62em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M96 96c-17.7 0-32 14.3-32 32s-14.3 32-32 32s-32-14.3-32-32C0 75 43 32 96 32h97c70.1 0 127 56.9 127 127c0 52.4-32.2 99.4-81 118.4l-63 24.5 0 18.1c0 17.7-14.3 32-32 32s-32-14.3-32-32V301.9c0-26.4 16.2-50.1 40.8-59.6l63-24.5C240 208.3 256 185 256 159c0-34.8-28.2-63-63-63H96zm48 384c-22.1 0-40-17.9-40-40s17.9-40 40-40s40 17.9 40 40s-17.9 40-40 40z"/></svg>

<br>

.pull-left.font150.bold[

&#x1F4AA; R 和 R 包大多是由统计学家和数据分析专家开发的

&#x1F4AA; R 有着异常丰富且紧跟前沿的 [.red[{{统计建模和机器学习包}}]](https://cran.r-project.org/web/views/)

&#x1F4AA; R 容易接口或调用其他应用软件（如 C++、python、Spark、TensorFlow 等）

]

.pull-right[
<img src="imgs/MLwR.png" width="80%" style="display: block; margin: auto;" />
]

---

### >> 在 <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> 中进行统计建模分析面临哪些问题 <svg aria-hidden="true" role="img" viewBox="0 0 320 512" style="height:1em;width:0.62em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M96 96c-17.7 0-32 14.3-32 32s-14.3 32-32 32s-32-14.3-32-32C0 75 43 32 96 32h97c70.1 0 127 56.9 127 127c0 52.4-32.2 99.4-81 118.4l-63 24.5 0 18.1c0 17.7-14.3 32-32 32s-32-14.3-32-32V301.9c0-26.4 16.2-50.1 40.8-59.6l63-24.5C240 208.3 256 185 256 159c0-34.8-28.2-63-63-63H96zm48 384c-22.1 0-40-17.9-40-40s17.9-40 40-40s40 17.9 40 40s-17.9 40-40 40z"/></svg>

<br>

.pull-left.font150.bold[

&#x1F630; 作为解析型的数据分析语言，R 的运行效率低于 C、C++、JAVA 等编译语言

&#x1F630; R 操作数据的规模往往受限于计算机内存的大小

.red[&#x1F630; 各式 R 包的接口并不统一]
]

.pull-right.code80[

|Function     |Package      |Code                                         |
|:------------|:------------|:--------------------------------------------|
|`lda`        |`MASS`       |`predict(obj)`                               |
|`glm`        |`stats`      |`predict(obj, type = "response")`            |
|`gbm`        |`gbm`        |`predict(obj, type = "response", n.trees)`   |
|`rpart`      |`rpart`      |`predict(obj, type = "prob")`                |
|`Weka`       |`RWeka`      |`predict(obj, type = "probability")`         |
]

--
<br>
.right.bold.font120[
[`caret`](https://github.com/topepo/caret/) -> .font120[&#x1F449;  .red[[`tidymodels`!](https://www.tidymodels.org/)]]<br>
[`mlr`](https://mlr.mlr-org.com/) -> [`mlr3`!](https://mlr3.mlr-org.com/)
]

---
layout: false

<hr>

---
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-rsample.svg)
background-size: 4em
background-position: 12% 45%

# 1. 数据重采样 -> `rsample` .font60[<sup>v1.1.1</sup>]

.font150[(_General Resampling Infrastructure_)]

---
layout: true

### >> 数据重采样 -> `rsample` 包 .font80[<sup>v1.1.1</sup>]

---

.code90[

```r
# 先安装并加载“一包顶N包”的 tidymodels
library(tidymodels)  # v1.0.0
```

```
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
```

```
#> ✔ broom        1.0.1      ✔ recipes      1.0.3 
#> ✔ dials        1.1.0      ✔ rsample      1.1.1 
#> ✔ dplyr        1.0.10     ✔ tibble       3.1.8 
#> ✔ ggplot2      3.4.0      ✔ tidyr        1.2.1 
#> ✔ infer        1.0.4      ✔ tune         1.0.1 
#> ✔ modeldata    1.0.1      ✔ workflows    1.1.2 
#> ✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
#> ✔ purrr        0.3.5      ✔ yardstick    1.1.0
```

```
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages
```
]

---

.pull-left.code95[

```r
# 设定随机数种子
set.seed(1234)
# 随机抽样分成训练组和试验组
iris_split <- initial_split(iris)
# initial_split(data, prop = 3/4,
#               strata = NULL,
#               breaks = 4, 
#               pool = 0.1, ...)
```

```r
# 查看分组情况
iris_split
```

```
#> <Training/Testing/Total>
#> <112/38/150>
```

```r
# 查看iris_split的类
iris_split %>% class()
```

```
#> [1] "initial_split" "mc_split"     
#> [3] "rsplit"
```

]

.pull-right.code85[

```r
# 查看 rsplit 类的底层结构
iris_split %>% str(vec.len = 1.5)
```

```
*#> List of 4
*#>  $ data  :'data.frame':	150 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 ...
#>   ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 ...
#>   ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 ...
#>   ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 ...
*#>  $ in_id : int [1:112] 28 80 101 111 ...
#>  $ out_id: logi NA
#>  $ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ id: chr "Resample1"
#>  - attr(*, "class")= chr [1:3] "initial_split" "mc_split" ...
```

```r
# 提取训练组样本（Q：测试组样本用哪个函数？）
iris_split %>% training() %>% 
  glimpse(width = 40)
```

```
#> Rows: 112
#> Columns: 5
#> $ Sepal.Length <dbl> 5.2, 5.7, 6.3, 6.…
#> $ Sepal.Width  <dbl> 3.5, 2.6, 3.3, 3.…
#> $ Petal.Length <dbl> 1.5, 3.5, 6.0, 5.…
#> $ Petal.Width  <dbl> 0.2, 1.0, 2.5, 2.…
#> $ Species      <fct> setosa, versicolo…
```

]

---
layout: false
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-recipes.svg)
background-size: 4em
background-position: 12% 45%

# 2. 数据预处理 -> `recipes` .font60[<sup>v1.0.3</sup>]

.font120[(_Preprocessing and Feature Engineering Steps for Modeling_)]

---

### >> 2.1 定义菜单

.pull-left.code85[

```r
iris_recipe <- 
  recipe(Species ~ ., data = iris) %>% 
  step_corr(all_predictors()) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

iris_recipe  # 复杂的recipe类，View()它！
```

```
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          4
#> 
#> Operations:
#> 
#> Correlation filter on all_predictors()
#> Centering for all_predictors()
#> Scaling for all_predictors()
```

]

.pull-right.font100[

.bold.info[`recipes` 包中共计有 97 个数据预处理步骤函数，示例如下：]

.code80[

```
 [1] "step_arrange"    "step_bagimpute" 
 [3] "step_bin2factor" "step_BoxCox"    
 [5] "..."             "..."            
 [7] "step_unorder"    "step_window"    
 [9] "step_YeoJohnson" "step_zv"        
```
]

.bold.info[你可以用以下五种方式（及其组合）在 `step_*()` 中指定拟作用的变量：]
- 直接使用变量名枚举相关变量
- 以 `dplyr` 包的方式，如 `starts_with()`
- 以 `all_outcomes()`、`all_predictors()` 或 `has_role()` 指定角色的方式
- 以 `all_numeric()`、`all_nominal()` 或 `has_type()` 指定类型的方式
- 以指定类型+角色复合的方式，如 `all_numeric_predictors()`

]

---

### >> 2.2 准备菜单

.pull-left.code90[

```r
iris_recipe <- 
  iris_recipe %>% 
  prep(training(iris_split),
       verbose = TRUE)
```

```
#> oper 1 step corr [training] 
#> oper 2 step center [training] 
#> oper 3 step scale [training] 
#> The retained training set is ~ 0 Mb  in memory.
```

```r
# prep(x, 
#      training = NULL, 
#      fresh = FALSE, 
#      verbose = FALSE, 
#      retain = TRUE, 
#      log_change = FALSE, 
#      strings_as_factors = TRUE, 
#      ...)
```

]

.pull-right.code90[

```r
iris_recipe
```

```
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          4
#> 
#> Training data contained 112 data points and no missing data.
#> 
#> Operations:
#> 
*#> Correlation filter on Petal.Length [trained]
*#> Centering for Sepal.Length, Sepal.Width, Petal.Width [trained]
*#> Scaling for Sepal.Length, Sepal.Width, Petal.Width [trained]
```

]

---

### >> 2.3 应用菜单

.pull-left.code90[

```r
# 训练组
# 可直接提取准备阶段 retain 的数据
iris_training <- iris_recipe %>% 
* bake(new_data = NULL)  # juice()

# bake(object, 
#      new_data, 
#      ..., 
#      composition = "tibble")

iris_training
```

```
#> # A tibble: 112 × 4
#>   Sepal.Length Sepal.Width Petal.Width Species   
#>          <dbl>       <dbl>       <dbl> <fct>     
#> 1       -0.805       1.09       -1.36  setosa    
#> 2       -0.212      -1.07       -0.316 versicolor
#> 3        0.500       0.607       1.64  virginica 
#> # … with 109 more rows
```

]

.pull-right.code85[

```r
# 测试组
# 用 bake() 进行预处理
iris_testing <- iris_recipe %>% 
* bake(new_data = testing(iris_split))

iris_testing
```

```
#> # A tibble: 38 × 4
#>   Sepal.Length Sepal.Width Petal.Width Species
#>          <dbl>       <dbl>       <dbl> <fct>  
#> 1       -0.924       1.09        -1.36 setosa 
#> 2       -1.52        0.847       -1.23 setosa 
#> 3       -1.28        0.847       -1.36 setosa 
#> # … with 35 more rows
```

]

---
layout: false
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-parsnip.svg)
background-size: 4em
background-position: 12% 45%

# 3. 模型训练 -> `parsnip` .font60[<sup>v1.0.3</sup>]

.font130[(_A Common API to Modeling and Analysis Functions_)]

---
layout: true

### >> 3.1 模型设定

---

.pull-left.code90[

```r
# 随机森林模型的统一 API
iris_rf <- rand_forest(
  mode = "classification", 
  trees = 100
)
# rand_forest(
#   mode = "unknown",
#   engine = "ranger",
#   mtry = NULL, trees = NULL,
#   min_n = NULL)

# 设定随机森林模型的计算引擎
# show_engines("rand_forest")
(iris_rf <- iris_rf %>% 
    set_engine("ranger"))
```

```
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   trees = 100
#> 
#> Computational engine: ranger
```
]

.pull-right.code85[

```r
# 查看 iris_rf 的类
iris_rf %>% class()
```

```
#> [1] "rand_forest" "model_spec"
```

```r
# 查看 iris_rf 的底层结构
iris_rf %>% str()
```

```
#> List of 7
#>  $ args                 :List of 3
#>   ..$ mtry : language ~NULL
#>   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
#>   ..$ trees: language ~100
#>   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
#>   ..$ min_n: language ~NULL
#>   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
#>  $ eng_args             : Named list()
#>   ..- attr(*, "class")= chr [1:2] "quosures" "list"
#>  $ mode                 : chr "classification"
#>  $ user_specified_mode  : logi TRUE
#>  $ method               : NULL
#>  $ engine               : chr "ranger"
#>  $ user_specified_engine: logi TRUE
#>  - attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
```
]

---
layout: true

### >> 3.2 模型拟合

---

.pull-left.code85[

```r
# 拟合随机森林模型
(iris_fit <- iris_rf %>% 
  fit( # parsnip::fit.model_spec()
    formula = Species ~ ., 
    data = iris_training
  ))
```

```
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  100 
#> Sample size:                      112 
#> Number of independent variables:  3 
#> Mtry:                             1 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.0563
```
]

.pull-right.code90[

```r
# 查看 iris_fit 的类
iris_fit %>% class()
```

```
#> [1] "_ranger"   "model_fit"
```

```r
# 查看 iris_fit 的底层结构
iris_fit %>% str(max.level = 1)
```

```
#> List of 5
#>  $ lvl    : chr [1:3] "setosa" "versicolor" "virginica"
*#>  $ spec   :List of 8
#>   ..- attr(*, "class")= chr [1:2] "rand_forest" "model_spec"
*#>  $ fit    :List of 13
#>   ..- attr(*, "class")= chr "ranger"
#>  $ preproc:List of 4
#>  $ elapsed:List of 1
#>  - attr(*, "class")= chr [1:2] "_ranger" "model_fit"
```
]

---
layout: true

### >> 3.3 模型预测

---

.pull-left.code90[

```r
# 将拟合模型应用于测试组
iris_fit %>% 
  predict(
    # parsnip::predict.model_fit()
    new_data = iris_testing
  )
```

```
#> # A tibble: 38 × 1
#>   .pred_class
#>   <fct>      
#> 1 setosa     
#> 2 setosa     
#> 3 setosa     
#> # … with 35 more rows
```

```r
# predict(object, new_data, 
#   type = NULL,  
#     # "numeric", "class", "prob", 
#     # "conf_int", "pred_int", 
#     # "quantile", 
#   opts = list(), ...)
```
]

.pull-right.code95[

```r
# 合并原数据集
iris_class <- iris_fit %>% 
  predict(
    new_data = iris_testing
  ) %>% 
* bind_cols(iris_testing)

iris_class %>% 
  glimpse(width = 40)
```

```
#> Rows: 38
#> Columns: 5
#> $ .pred_class  <fct> setosa, setosa, s…
#> $ Sepal.Length <dbl> -0.9241, -1.5176,…
#> $ Sepal.Width  <dbl> 1.087, 0.847, 0.8…
#> $ Petal.Width  <dbl> -1.359, -1.229, -…
#> $ Species      <fct> setosa, setosa, s…
```
]

---
layout: false
class: inverse, center, middle, hide_logo
background-image: url(imgs/logo-yardstick.svg)
background-size: 4em
background-position: 12% 45%

# 4. 模型表现 -> `yardstick` .font60[<sup>v1.1.0</sup>]

.font130[(_Tidy Characterizations of Model Performance_)]

---
layout: true

### >> 4.1 模型表现：`metrics()`

---

.pull-left.code90[

```r
iris_class %>% glimpse(width = 40)
```

```r
iris_class %>% 
* metrics( # 计算模型表现指标
*   truth = Species,
*   estimate = .pred_class
* )
```

```
#> # A tibble: 2 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy multiclass     0.947
#> 2 kap      multiclass     0.920
```
]

.pull-right.code85[

```r
# 提取各分类的预测概率
iris_probs <- iris_fit %>% 
  predict(
    new_data = iris_testing,
*   type = "prob"
  ) %>% 
  bind_cols(iris_class)

# 模型表现指标：基于分类预测概率
iris_probs %>% 
  metrics(
    truth = Species, 
    estimate = .pred_class,
*   .pred_setosa:.pred_virginica
  )
```

```
#> # A tibble: 4 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 accuracy    multiclass     0.947
#> 2 kap         multiclass     0.920
*#> 3 mn_log_loss multiclass     0.199
*#> 4 roc_auc     hand_till      0.979
```
]

---
layout: true

### >> 4.2 模型表现：`*_curve()`

---

.pull-left.code90[

```r
# 接收器操作特征曲线
iris_roc <- iris_probs %>% 
  roc_curve(
    truth = Species, 
    .pred_setosa:.pred_virginica
  )    
# iris_roc -> a "roc_df"

iris_roc %>% autoplot()
```

<img src="L13_tidymodels_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right.code90[

```r
# 增益曲线
iris_gain <- iris_probs %>% 
* gain_curve(
    truth = Species, 
    .pred_setosa:.pred_virginica
  )
# iris_gain   # a "gain_df"

iris_gain %>% autoplot()
```

<img src="L13_tidymodels_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" />
]

---
layout: false
class: inverse, center, middle

# 5. 一个较复杂的案例

.font150[(_Case Study_)]

---
### >> 5.0 数据

.code90[

```r
library(AmesHousing)
(ames <- make_ames() %>% dplyr::select(-matches("Qu")))
```

```
#> # A tibble: 2,930 × 74
#>   MS_SubClass  MS_Zo…¹ Lot_F…² Lot_A…³ Street Alley Lot_S…⁴ Land_…⁵ Utili…⁶
#>   <fct>        <fct>     <dbl>   <int> <fct>  <fct> <fct>   <fct>   <fct>  
#> 1 One_Story_1… Reside…     141   31770 Pave   No_A… Slight… Lvl     AllPub 
#> 2 One_Story_1… Reside…      80   11622 Pave   No_A… Regular Lvl     AllPub 
#> 3 One_Story_1… Reside…      81   14267 Pave   No_A… Slight… Lvl     AllPub 
#> # … with 2,927 more rows, 65 more variables: Lot_Config <fct>,
#> #   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>,
#> #   Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> #   Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> #   Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>,
#> #   Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>,
#> #   Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …
```
]

---
### >> 5.1 数据重采样 -> `rsample` 包

.pull-left.code90[
<img src="imgs/resampling.svg" width="100%" style="display: block; margin: auto;" />

```r
set.seed(1234)
data_split <- 
  initial_split(
    ames, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test  <- testing(data_split)
nrow(ames_train) / nrow(ames)
```

```
#> [1] 0.75
```
]

.pull-right.code85[

```r
# v折交叉验证
(cv_splits <- vfold_cv(ames_train))
```

```
#> #  10-fold cross-validation 
#> # A tibble: 10 × 2
#>   splits             id    
#>   <list>             <chr> 
#> 1 <split [1977/220]> Fold01
#> 2 <split [1977/220]> Fold02
#> 3 <split [1977/220]> Fold03
#> # … with 7 more rows
```

```r
cv_splits$splits[[1]]
```

```
#> <Analysis/Assess/Total>
#> <1977/220/2197>
```

```r
cv_splits$splits[[1]] %>% 
  analysis() %>% dim()
```

```
#> [1] 1977   74
```

`vignette("Common_Patterns", package = "rsample")`
]

---
layout: true

### >> 5.2 数据预处理 -> `recipes` 包

---

```r
# 1. 定义数据预处理菜单
ames_rec <- 
  recipe(Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + 
           Full_Bath + Year_Sold + Lot_Area + Central_Air + Longitude + Latitude,
         data = ames_train) %>%
  step_log(Sale_Price, base = 10) %>%
  step_BoxCox(Lot_Area, Gr_Liv_Area) %>%
  step_other(Neighborhood, threshold = 0.05)  %>%
  step_dummy(all_nominal()) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_bs(Longitude, Latitude, options = list(df = 5))

ames_rec %>% tidy()
```

```
#> # A tibble: 6 × 6
#>   number operation type   trained skip  id          
#>    <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
#> 1      1 step      log    FALSE   FALSE log_YNFHa   
#> 2      2 step      BoxCox FALSE   FALSE BoxCox_vCuOq
#> 3      3 step      other  FALSE   FALSE other_0nogs 
#> # … with 3 more rows
```

---
.pull-left.code85[

```r
# 2. 准备数据预处理
cv_splits <- cv_splits %>% 
  mutate(
    recipes = map(splits, prepper,
                  recipe = ames_rec)
  )
cv_splits
```

```
#> #  10-fold cross-validation 
#> # A tibble: 10 × 3
#>   splits             id     recipes 
#>   <list>             <chr>  <list>  
#> 1 <split [1977/220]> Fold01 <recipe>
#> 2 <split [1977/220]> Fold02 <recipe>
#> 3 <split [1977/220]> Fold03 <recipe>
#> # … with 7 more rows
```

```r
# prepper
# function(split_obj, recipe, ...) {
#   prep(recipe, 
#        training = rsample::analysis(split_obj), 
#        fresh = TRUE, ...)
# }
```
]

.pull-right.code95[

```r
# 相同的菜单 + 
# 不同的训练数据集 -> 
# 不同的准备结果
cv_splits$recipes %>% 
  map_dfr(tidy, number = 2) %>% 
  filter(terms == "Lot_Area")
```

```
#> # A tibble: 10 × 3
#>   terms     value id          
#>   <chr>     <dbl> <chr>       
#> 1 Lot_Area 0.0836 BoxCox_vCuOq
#> 2 Lot_Area 0.0777 BoxCox_vCuOq
#> 3 Lot_Area 0.104  BoxCox_vCuOq
#> # … with 7 more rows
```

```r
# recipes:::tidy.recipe
```
]

---
layout: true

### >> 5.3 拟合模型 -> `parsnip` 包

---

--
.pull-left.code85[

```r
# 1. 设定模型
knn_mod <- 
  nearest_neighbor(
    mode = "regression",
    neighbors = 5) %>%
  set_engine("kknn") # 默认，可不设置

# 2. 拟合模型
# 2.1 定义辅助函数
parsnip_fit <- function(rec, model) {
  fit(model, Sale_Price ~ ., 
      data = bake(rec, new_data= NULL))
}
# 2.2 应用辅助函数
cv_splits <- cv_splits %>% 
  mutate(
    knn = map(
      recipes, 
      parsnip_fit, 
      model = knn_mod
    )
  )
```
]

--
.pull-right.code85[

```r
# 查看结果
cv_splits %>% select(-id)
```

```
#> # A tibble: 10 × 3
#>   splits             recipes  knn     
#>   <list>             <list>   <list>  
#> 1 <split [1977/220]> <recipe> <fit[+]>
#> 2 <split [1977/220]> <recipe> <fit[+]>
#> 3 <split [1977/220]> <recipe> <fit[+]>
#> # … with 7 more rows
```

```r
cv_splits$knn[[1]] %>% 
  str(max.level = 1)
```

```
#> List of 5
#>  $ lvl    : NULL
#>  $ spec   :List of 7
#>   ..- attr(*, "class")= chr [1:2] "nearest_neighbor" "model_spec"
#>  $ fit    :List of 10
#>   ..- attr(*, "class")= chr [1:2] "train.kknn" "kknn"
#>  $ preproc:List of 1
#>  $ elapsed:List of 1
#>  - attr(*, "class")= chr [1:2] "_train.kknn" "model_fit"
```
]

---
layout: true

### >> 5.4 评估模型表现 -> `parsnip` 包 + `yardstick` 包

---

--
.pull-left.code80[

```r
# 定义模型表现函数
pf_metrics <- metric_set(rmse, rsq, ccc)

# 定义辅助函数
parsnip_metrics <- 
  function(split, recipe, model) {
    raw <- assessment(split)
    processed <- bake(recipe, 
                      new_data = raw)
    model %>%
      predict(new_data = processed) %>%
      bind_cols(processed) %>% 
      pf_metrics(Sale_Price, .pred)
  }
# 应用辅助函数
cv_splits <- cv_splits %>%
  mutate(
    metrics = pmap(
      list(
        split  = splits, 
        recipe = recipes, 
        model  = knn
      ),
      parsnip_metrics 
    )
  )
```
]

--
.pull-right.code85[

```r
cv_splits$metrics[[1]]
```

```
#> # A tibble: 3 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.0839
#> 2 rsq     standard      0.761 
#> 3 ccc     standard      0.871
```

```r
# 汇总计算模型表现指标
cv_splits %>%
  unnest(cols = metrics) %>%
  group_by(.metric) %>%
  summarise(
    mean = mean(.estimate),
    std_err = sd(.estimate)/sqrt(n())
  )
```

```
#> # A tibble: 3 × 3
#>   .metric   mean std_err
#>   <chr>    <dbl>   <dbl>
#> 1 ccc     0.869  0.00519
#> 2 rmse    0.0870 0.00245
#> 3 rsq     0.764  0.00889
```
]

---
layout: true

### >> 5.5 模型调参：手动模式

---

<br>

---

.pull-left.code80[

```r
# 定义辅助函数
m_metrics <- function(split, recipe, fit) {
  holdout <- bake(
    recipe, 
    new_data = assessment(split))
  resample_name <- labels(split)$id
  multi_predict(fit,
                new_data = holdout,
                neighbors = 1:50) %>%
    bind_cols(., 
      holdout %>% select(Sale_Price)) %>%
    unnest(cols = .pred) %>% 
    group_by(neighbors) %>% 
    pf_metrics(Sale_Price, .pred) %>%
    mutate(resample = resample_name)
}
# 应用辅助函数
cv_splits <- cv_splits %>% 
  mutate(
    tuning_metrics = 
      pmap(
        list(splits, recipes, knn), 
        m_metrics
      )
  )
```
]

--
.pull-right.code85[

```r
cv_splits$tuning_metrics[[1]]
```

```
#> # A tibble: 150 × 5
#>   neighbors .metric .estimator .estimate resample
#>       <int> <chr>   <chr>          <dbl> <chr>   
#> 1         1 rmse    standard      0.101  Fold01  
#> 2         2 rmse    standard      0.0951 Fold01  
#> 3         3 rmse    standard      0.0898 Fold01  
#> # … with 147 more rows
```

```r
# 汇总计算模型表现指标
(rs_rmse <- cv_splits %>% 
  unnest(tuning_metrics) %>% 
  filter(.metric == "rmse") %>% 
  group_by(neighbors) %>% 
  summarize(rmse = mean(.estimate)))
```

```
#> # A tibble: 50 × 2
#>   neighbors   rmse
#>       <int>  <dbl>
#> 1         1 0.104 
#> 2         2 0.0972
#> 3         3 0.0921
#> # … with 47 more rows
```
]

---

.pull-left.code85[

```r
(min_rmse <- rs_rmse %>% 
  arrange(rmse) %>% slice(1))
```

```
#> # A tibble: 1 × 2
#>   neighbors   rmse
#>       <int>  <dbl>
#> 1        11 0.0837
```

```r
ggplot(rs_rmse, 
    aes(x = neighbors, y = rmse)) + 
  geom_point() + geom_path() + 
  geom_point(data = min_rmse, 
          size = 3, color = "red")
```

<img src="L13_tidymodels_files/figure-html/unnamed-chunk-42-1.png" width="80%" style="display: block; margin: auto;" />
]

--
.pull-right.code90[

```r
# 最终模型
ames_rec_final <- prep(ames_rec)
knn_mod_final <- 
  update(knn_mod, 
         neighbors = 11) %>% 
  fit(Sale_Price ~ ., 
      data = bake(ames_rec_final,
                  new_data = NULL))

# 评估最终模型在测试集上的表现
parsnip_metrics(data_split, 
                ames_rec_final,
                knn_mod_final)
```

```
#> # A tibble: 3 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.0754
#> 2 rsq     standard      0.813 
#> 3 ccc     standard      0.894
```
]

---
layout: true

### >> 5.6 模型调参：`workflows` .font80[<sup>v1.1.2</sup>] + `tune` .font80[<sup>v1.0.1</sup>]

---

.pull-left[

```r
# 重新定义模型：设定需调校的参数
knn_mod <- 
  nearest_neighbor(
    mode = "regression",
*   neighbors = tune()) %>%
  set_engine("kknn")

# 定义流程
ames_wf <- 
  workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(ames_rec)

# 基于参数网格计算模型表现指标
ames_res <- 
  ames_wf %>% 
* tune_grid(
    cv_splits %>% select(splits, id),
*   grid = tibble(neighbors = 1:50),
    metrics = metric_set(rmse, rsq, ccc)
  )
```
]

.pull-right[

```r
ames_res
```

```
#> # Tuning results
#> # 10-fold cross-validation 
#> # A tibble: 10 × 4
#>   splits             id     .metrics           .notes          
#>   <list>             <chr>  <list>             <list>          
#> 1 <split [1977/220]> Fold01 <tibble [150 × 5]> <tibble [0 × 3]>
#> 2 <split [1977/220]> Fold02 <tibble [150 × 5]> <tibble [0 × 3]>
#> 3 <split [1977/220]> Fold03 <tibble [150 × 5]> <tibble [1 × 3]>
#> # … with 7 more rows
#> 
#> There were issues with some computations:
#> 
#>   - Warning(s) x2: some 'x' values beyond boundary knots may cause ill-conditioned bases
#> 
#> Run `show_notes(.Last.tune.result)` for more information.
```

```r
ames_res %>% collect_metrics() %>% 
  glimpse(width = 40)
```

```
#> Rows: 150
#> Columns: 7
#> $ neighbors  <int> 1, 1, 1, 2, 2, 2, 3…
#> $ .metric    <chr> "ccc", "rmse", "rsq…
#> $ .estimator <chr> "standard", "standa…
#> $ mean       <dbl> 0.8276, 0.1042, 0.6…
#> $ n          <int> 10, 10, 10, 10, 10,…
#> $ std_err    <dbl> 0.00943, 0.00302, 0…
#> $ .config    <chr> "Preprocessor1_Mode…
```
]

---

```r
ames_res %>% 
* collect_metrics() %>%
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = neighbors, y = mean)) + 
  geom_point() + geom_line() + 
  geom_point(size = 2, color = 'red', 
             data = . %>% slice_min(mean)) + 
  ylab("rmse")
```

---

```r
(best_knn <- 
    ames_res %>% select_best("rmse"))
```

```
#> # A tibble: 1 × 2
#>   neighbors .config              
#>       <int> <chr>                
#> 1        11 Preprocessor1_Model11
```

```r
ames_fit_test <- ames_wf %>% 
* finalize_workflow(best_knn) %>%
* last_fit(
    split = data_split,
    metrics = metric_set(rmse, rsq, ccc)
  )
ames_fit_test %>% glimpse()
```

```
#> Rows: 1
#> Columns: 6
#> $ splits       <list> [<initial_split[2197 x 733 x 2930 x 74]>]
#> $ id           <chr> "train/test split"
#> $ .metrics     <list> [<tbl_df[3 x 4]>]
#> $ .notes       <list> [<tbl_df[1 x 3]>]
#> $ .predictions <list> [<tbl_df[733 x 4]>]
#> $ .workflow    <list> [Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath, Year…
```

---

.pull-left[

```r
ames_fit_test %>% 
  collect_metrics()
```

```
#> # A tibble: 3 × 4
#>   .metric .estimator .estimate .config  
#>   <chr>   <chr>          <dbl> <chr>    
#> 1 rmse    standard      0.0754 Preproce…
#> 2 rsq     standard      0.813  Preproce…
#> 3 ccc     standard      0.894  Preproce…
```

]

.pull-right[

```r
ames_fit_test %>% 
  collect_predictions() %>%
  ggplot(aes(Sale_Price, .pred)) +
  geom_abline(lty = 2, size = 1.5,  
              color = "gray50") + 
  geom_point(alpha = 0.5) + 
  coord_fixed()
```

]

---
layout: false

### >> `tidymodels` 家族！

---
layout: false
class: inverse, center, middle

# 课后作业

---
class: middle

.font150[

1. 在 📑 _Rmd_ 中键入并运行第13讲的代码（自觉完成，无需提交）

1. 登录 .red[[{{`tidymodels`}}](https://www.tidymodels.org/)] 的主页 🏠，浏览并学习相关资料

1. 3 本 `tidymodels` 核心团队撰写的 书 📖，供感兴趣的同学学习参考：
  - [Tidy Modeling with R](https://www.tmwr.org/)
  - [Feature Engineering and Selection](https://bookdown.org/max/FES/)
  - [Supervised Machine Learning for Text Analysis in R](https://smltar.com/)

]

---
class: center, middle, hide_logo
background-image: url(imgs/xaringan.png)
background-size: 12%
background-position: 50% 40%

<br><br><br><br><br><br><br>
<hr color='#f00' size='2px' width='80%'>
<br>
.Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能！**_]