第8讲字符串、因子和日期-时间

class: center, middle, inverse, title-slide

.title[
# 量化金融与金融编程
]
.subtitle[
## <br>L08 字符串、因子和日期-时间
]
.author[
### <br>曾永艺
]
.institute[
### 厦门大学管理学院
]
.date[
### <br>2022-11-18
]

---

class: middle
background-image: url(imgs/s-f-l.png)
background-size: 15em
background-position: 90% 85%

<div>
<style type="text/css">.xaringan-extra-logo {
width: 50px;
height: 50px;
z-index: 0;
background-image: url(imgs/logo-sm.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:0.5em;right:0.5em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

.pull-left[

### 1. 字符串

* .font130[字符串基础]
  
  * .font130[模式匹配]
  
  * .font130[工具函数]

<br>

### 2. 因子

* .font130[因子向量]
  
  * .font130[修改因子水平]
  
  * .font130[调整因子次序]

]

.pull-right[

### 3. 日期-时间

* .font130[创建日期-时间]
  
  * .font130[处理时间成分]
  
  * .font130[时间间隔]

]

---
class: inverse, center, middle
background-image: url(imgs/logo-stringr.png), url(imgs/sxc.png)
background-size: 10%, 100%
background-position: 30% 40%, 0% 100%

# 1. 字符串

.font120[(.bold[`stringr`<sup>.font60[v1.4.1]</sup>]: Simple, Consistent Wrappers for Common String Operations)]

---
layout: true

### >> 1.1 字符串基础

---

```r
# 用 '' 或 "" 来创建字符串，但优先用 ""
x <- "This is a string."; x_zh <- '这是中文字符串。'
```

```r
# 若想在字符串中包括'或"，你需要使用转义符 \
x <- '"This is a string enclosed in \" & \"!"'    # ?'"'
x   # 打印字符串
```

```
#> [1] "\"This is a string enclosed in \" & \"!\""
```

```r
writeLines(x)   # 写出字符串的内容
```

```
#> "This is a string enclosed in " & "!"
```

```r
# 可直接输入给定Unicode编码表代码的字符
(zh <- c("\u4e00", "\u9fa5"))  # Unicode编码表中第一个和最后一个中文字符
```

```
#> [1] "一" "龥"
```

```r
# 用c()创建字符向量来存储多个字符串
(greeting <- c("Good morning", "同学们", "！"))
```

```
#> [1] "Good morning" "同学们"       "！"
```

---

.full-width[.content-box-blue.bold.font120[字符串长度]]

```r
c("Good morning", "同学们", "！") %>% str_length()  # {stringr}
# c("Good morning", "同学们", "！") %>% nchar()     # {base}
```

```
#> [1] 12  3  1
```

.full-width[.content-box-blue.bold.font120[字符串连接]]

```r
str_c("X", str_replace_na(c("a", NA, "c"), replacement = "NA"), 1:3, sep = "_")
# paste("X", c("a", NA, "c"), 1:3, sep = "_")
```

```
#> [1] "X_a_1"  "X_NA_2" "X_c_3"
```

```r
str_c("X", str_replace_na(c("a", NA, "c")), 1:3,  sep = "_", 
*     collapse = " |-> ")
paste0("X", c("a", NA, "c"), 1:3, sep = "_", collapse = " |-> ")
str_flatten(c("X", str_replace_na(c("a", NA, "c")), 1:3, sep = "_"), collapse = " |-> ")
```

```
#> [1] "X_a_1 |-> X_NA_2 |-> X_c_3"
#> [1] "Xa1_ |-> XNA2_ |-> Xc3_"
#> [1] "X |-> a |-> NA |-> c |-> 1 |-> 2 |-> 3 |-> _"
```

---

.full-width[.content-box-blue.bold.font120[提取（修改）子字符串]]

.code110[

```r
greeting
```

```
#> [1] "Good morning" "同学们"       "！"
```

```r
str_sub(greeting, start = 1:3, end = -2)    # 提取
```

```
#> [1] "Good mornin" "学"          ""
```

```r
str_sub(greeting, 1:3, -2) <- LETTERS[1:3]  # 修改
greeting
```

```
#> [1] "Ag"    "同B们" "！C"
```

```r
# base::substr(x, start, stop)
```
]

---
layout: true

### >> 1.2 模式匹配

---

.full-width[.content-box-blue.bold.font120[`stringr` 包多数函数都有 `pattern` 参数，默认调用正则表达式函数 `regex()`]]

> 正则表达式（_Regular Expression, RegEx_）是一种用来精简描述字符串模式的语言。
> `stringr` 包采用 `ICU` 正则表达式引擎（类似于 `Perl`）。

.pull-left[

```r
x <- c("apple & banana", "ana@财管2020")

# 基础匹配（精确匹配指定字符）
# -----------------------------
str_view(x, pattern = "na")
```

<div id="htmlwidget-e0c80754446792aea569" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-e0c80754446792aea569">{"x":{"html":"<ul>\n  <li>apple & ba<span class='match'>na<\/span>na<\/li>\n  <li>a<span class='match'>na<\/span>@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
# "." 匹配任意一个字符（换行符除外）
# -----------------------------
str_view(x, pattern = "..a..")
```

<div id="htmlwidget-c07f0f8d61bef8227562" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-c07f0f8d61bef8227562">{"x":{"html":"<ul>\n  <li>apple &<span class='match'> bana<\/span>na<\/li>\n  <li><span class='match'>ana@财<\/span>管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right[

```r
# 用转义符"\"转义
# -----------------------------
# 正则表达式"\\."匹配字符"."
str_view(c("abc", "a.c"), "a\\.c")
```

<div id="htmlwidget-3d31c4b31c541d55c861" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-3d31c4b31c541d55c861">{"x":{"html":"<ul>\n  <li>abc<\/li>\n  <li><span class='match'>a.c<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
# 需要用"\\\\"来匹配一个字符"\"！
# str_view("a\\c", "a\\c")
str_view("a\\c", "a\\\\c")
```

<div id="htmlwidget-1958908fb874a8953d68" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-1958908fb874a8953d68">{"x":{"html":"<ul>\n  <li><span class='match'>a\\c<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

---

.pull-left.code100[

```r
# 锚定符"^"和"$"
# ----------------------------

str_view(x, "^a")  # ^：开头
```

<div id="htmlwidget-9f1c485b0e81c3ef7f2f" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-9f1c485b0e81c3ef7f2f">{"x":{"html":"<ul>\n  <li><span class='match'>a<\/span>pple & banana<\/li>\n  <li><span class='match'>a<\/span>na@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view(x, "a$")  # $：结尾
```

<div id="htmlwidget-405383a8d7980cadea57" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-405383a8d7980cadea57">{"x":{"html":"<ul>\n  <li>apple & banan<span class='match'>a<\/span><\/li>\n  <li>ana@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view(x, "^ana$")
```

<div id="htmlwidget-e5d8fb33a15003a4fcd6" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-e5d8fb33a15003a4fcd6">{"x":{"html":"<ul>\n  <li>apple & banana<\/li>\n  <li>ana@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right.code95[

```r
# 匹配一组字符中的任意一个
# -----------------------------
# "\\d"匹配任意阿拉伯数字
# "\\s"匹配任意空白符
#     （如空格符、tab、换行符等）
# "[abc.]"匹配a或b或c或.（无需转义）
# "|"为或，"(a|b|c)"也匹配a或b或c
# "[a-c]"匹配从a到c的任意字符
*# "[^abc]"匹配非a非b和非c的任意字符

*str_view_all(x, "\\d")
```

<div id="htmlwidget-91e5a1cef3e21577e9c0" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-91e5a1cef3e21577e9c0">{"x":{"html":"<ul>\n  <li>apple & banana<\/li>\n  <li>ana@财管<span class='match'>2<\/span><span class='match'>0<\/span><span class='match'>2<\/span><span class='match'>0<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view_all(x, "[\u4e00-\u9fa5]")
```

<div id="htmlwidget-e9f3fe67bcecb62fab30" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-e9f3fe67bcecb62fab30">{"x":{"html":"<ul>\n  <li>apple & banana<\/li>\n  <li>ana@<span class='match'>财<\/span><span class='match'>管<\/span>2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

---

.pull-left.code95[

```r
# 重复
# -----------------------------
# ?：0或1次
# +：至少1次
# *：任意次数
# {n}：正好n次
# {n,}：至少n次
# {,m}：至多m次
# {n,m}：n次和m次之间

# 从左到右依次查找直至匹配
str_view(x, "(a.)*")
```

<div id="htmlwidget-d6f2e2359b281333371d" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-d6f2e2359b281333371d">{"x":{"html":"<ul>\n  <li><span class='match'>ap<\/span>ple & banana<\/li>\n  <li><span class='match'>ana@<\/span>财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view(x, "(a.){2,}")
```

<div id="htmlwidget-44f0b240b010a6f88eca" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-44f0b240b010a6f88eca">{"x":{"html":"<ul>\n  <li>apple & b<span class='match'>anan<\/span>a<\/li>\n  <li><span class='match'>ana@<\/span>财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right.code100[

```r
# 重复
# -----------------------------
# 默认为“贪婪”，最长匹配字符串
# 加上?修改为“懒惰”，最短匹配字符串

str_view(x, "(a.){1,}")
```

<div id="htmlwidget-3afbb8df2bcc95e00cc6" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-3afbb8df2bcc95e00cc6">{"x":{"html":"<ul>\n  <li><span class='match'>ap<\/span>ple & banana<\/li>\n  <li><span class='match'>ana@<\/span>财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view(x, "(a.){1,}?")
```

<div id="htmlwidget-4633f14870f3e7ec3d10" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-4633f14870f3e7ec3d10">{"x":{"html":"<ul>\n  <li><span class='match'>ap<\/span>ple & banana<\/li>\n  <li><span class='match'>an<\/span>a@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

---

.pull-left.code100[

```r
# 分组与回溯引用(backreference)
# -----------------------------
# ()：字符串分组
#   （使用()也可使得RegEx更清晰）
# \1：回溯引用第1组匹配字符串

str_view(x, "(.).*\\1")
```

<div id="htmlwidget-81104ac738452b232fda" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-81104ac738452b232fda">{"x":{"html":"<ul>\n  <li><span class='match'>apple & banana<\/span><\/li>\n  <li><span class='match'>ana<\/span>@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view(x, "(.)(.).\\2\\1")
```

<div id="htmlwidget-44fb028bd2deafacb947" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-44fb028bd2deafacb947">{"x":{"html":"<ul>\n  <li>apple & b<span class='match'>anana<\/span><\/li>\n  <li>ana@财管2020<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

.pull-right.code90[

```r
# regex()的参数
# -----------------------------
# ignore_case = FALSE
# multiline   = FALSE
# comments    = FALSE
# dotall      = FALSE

str_view("AB", 
  regex("ab", ignore_case = TRUE))
```

<div id="htmlwidget-f5993a389474a4cc2a87" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-f5993a389474a4cc2a87">{"x":{"html":"<ul>\n  <li><span class='match'>AB<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
xd <- regex("
  \\(?        # 可能的(
  (0592)      # 四位区号0592，第1组
  [^\\d]*     # 可能的非数字，如)、-等
  (218\\d{4}) # 218开头的7个号码，第2组
  ", comments = TRUE)
str_match("0592-2187091", xd)
```

```
#>      [,1]           [,2]   [,3]     
#> [1,] "0592-2187091" "0592" "2187091"
```
]

---

.full-width[.content-box-blue.bold.font120[`pattern` 参数还支持 `fixed()`、`coll()` 和 `boundary()`]]

--
.pull-left.code95[

```r
# fixed()会忽略特殊的RegEx字符
# -----------------------------
str_view_all("\\d for 123...", 
             fixed("\\d"))
```

<div id="htmlwidget-4c86225a3d3001429612" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-4c86225a3d3001429612">{"x":{"html":"<ul>\n  <li><span class='match'>\\d<\/span> for 123...<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
# coll()按标准字符分类规则匹配字符
# -----------------------------
str_subset(
  c("I", "İ", "i", "ı"), 
  coll("i", ignore_case = TRUE, 
       locale = "tr")
)
```

```
#> [1] "İ" "i"
```
]

.pull-right.code95[

```r
# boundary(type = ...)匹配边界
# -----------------------------
# "character"：默认按字符
# "line_break"：按行
# "sentence"：按句
# "word"：按词

str_view_all(c("字符串与因子"), 
             boundary())
```

<div id="htmlwidget-c2a7a3979625b5903323" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-c2a7a3979625b5903323">{"x":{"html":"<ul>\n  <li><span class='match'>字<\/span><span class='match'>符<\/span><span class='match'>串<\/span><span class='match'>与<\/span><span class='match'>因<\/span><span class='match'>子<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_view_all(c("字符串与因子"), 
             boundary("word"))
```

<div id="htmlwidget-3fc14bab63567589b8b0" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-3fc14bab63567589b8b0">{"x":{"html":"<ul>\n  <li><span class='match'>字符<\/span><span class='match'>串<\/span><span class='match'>与<\/span><span class='match'>因子<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>
]

---
layout: true

### >> 1.3 工具函数

---

.full-width[.content-box-blue.bold.font120[匹配与否及匹配次数：`str_detect()` 和 `str_count()`]]

.pull-left.code90[

```r
# x向量中哪些字符串包含数字？
str_detect(x, "\\d")
```

```
#> [1] FALSE  TRUE
```

```r
# 内置向量words中以元音字母结尾词占比
mean(str_detect(words, "[aeiou]$"))
```

```
#> [1] 0.277
```

```r
# 提取words中以字母x结尾的词
words[str_detect(words, "x$")]
```

```
#> [1] "box" "sex" "six" "tax"
```

```r
# 等同于
str_subset(words, "x$")
```

```
#> [1] "box" "sex" "six" "tax"
```
]

.pull-right.code85[

```r
# 在数据框中搭配filter()使用
df <- tibble(word = words, 
             i = seq_along(word))
df %>% filter(str_detect(words, "x$"))
```

```
#> # A tibble: 4 × 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841
```

```r
# 使用mutate()+str_count()统计匹配次数
df %>% mutate(
  vo = str_count(words, "[^aeiou]"))
```

```
#> # A tibble: 980 × 3
#>   word      i    vo
#>   <chr> <int> <int>
#> 1 a         1     0
#> 2 able      2     2
#> 3 about     3     2
#> # … with 977 more rows
```
]

---

.full-width[.content-box-blue.bold.font120[替换匹配：`str_replace()` 和 `str_replace_all()`]]

.pull-left.code95[

```r
# 将x向量中的元音字母替换为"*"
str_replace(x, "[aeiou]", "*")
```

```
#> [1] "*pple & banana" "*na@财管2020"
```

```r
# str_replace()只替换了第一个匹配
# str_replace_all()进行全部替换
# 如下可删除全部元音字母
str_replace_all(x, "[aeiou]", "")
```

```
#> [1] "ppl & bnn"  "n@财管2020"
```

```r
# 等同于
str_remove_all(x, "[aeiou]")
```

```
#> [1] "ppl & bnn"  "n@财管2020"
```
]

.pull-right.code95[

```r
# 通过指定命名向量完成多重替换
x2 <- c("1 car", "2 cats", 
        "3 cups")
str_replace_all(x2, 
  c("1" = "one", "\\d" = "some"))
```

```
#> [1] "one car"   "some cats" "some cups"
```

```r
# 回溯引用替换
str_replace(
  x, 
  "^(...)(.*)(...)$", 
  "\\3\\2\\1"
)
```

```
#> [1] "anale & banapp" "020@财管2ana"
```
]

---

.full-width[.content-box-blue.bold.font120[匹配分割：`str_split()` 和 `str_split_fixed()`]]

.code90[

```r
info <- c("姓名: 张三 & 李四", "年龄 @ 20 & 21", "宿舍 @ L2-111")
```
]

--
.pull-left.code95[

```r
# 在给定pattern处
# 将字符串向量info分割为列表
info %>% str_split(" ?[:@&] ")
# 默认参数n = Inf, simplify = FALSE
```

```
#> [[1]]
#> [1] "姓名" "张三" "李四"
#> 
#> [[2]]
#> [1] "年龄" "20"   "21"  
#> 
#> [[3]]
#> [1] "宿舍"   "L2-111"
```
]

.pull-right.code85[

```r
# 分割为矩阵
info %>% str_split(" ?[:@&] ", 
           n = 2, simplify = TRUE)
```

```
#>      [,1]   [,2]         
#> [1,] "姓名" "张三 & 李四"
#> [2,] "年龄" "20 & 21"    
#> [3,] "宿舍" "L2-111"
```

```r
# 分割为矩阵
info %>% 
  str_split_fixed(" ?[:@&] ", n = 4)
```

```
#>      [,1]   [,2]     [,3]   [,4]
#> [1,] "姓名" "张三"   "李四" ""  
#> [2,] "年龄" "20"     "21"   ""  
#> [3,] "宿舍" "L2-111" ""     ""
```
]

---

.full-width[.content-box-blue.bold.font120[提取匹配：`str_extract()` 和 `str_extract_all()`]]

.pull-left.code80[

```r
head(sentences, n = 5)  # 720 in total
```

```
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."
```

```r
# 想从中找出含有2种及以上彩虹颜色单词的句子
rainbow <- c("red", "orange", "yellow", 
             "green", "blue", "indigo", 
             "purple")
rb_match <- str_c(rainbow, 
                  collapse = "|")
rb_match
```

.pull-right.code80[

```r
more <- sentences[
  str_count(sentences, rb_match) >= 2]
str_view_all(more, rb_match)
```

<div id="htmlwidget-0ac826c25a932da933a2" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-0ac826c25a932da933a2">{"x":{"html":"<ul>\n  <li>It is hard to erase <span class='match'>blue<\/span> or <span class='match'>red<\/span> ink.<\/li>\n  <li>The <span class='match'>green<\/span> light in the brown box flicke<span class='match'>red<\/span>.<\/li>\n  <li>The sky in the west is tinged with <span class='match'>orange<\/span> <span class='match'>red<\/span>.<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

```r
str_extract(more, rb_match)
```

```
#> [1] "blue"   "green"  "orange"
```

```r
str_extract_all(more, rb_match, 
                simplify = TRUE)
```

```
#>      [,1]     [,2] 
#> [1,] "blue"   "red"
#> [2,] "green"  "red"
#> [3,] "orange" "red"
```
]

---

.full-width[.content-box-blue.bold.font120[分组提取匹配：`str_match()` 和 `str_match_all()`]]

.code90[

```r
# 想从中sentences数据集中找出名词（使用启发式法则：在冠词a或the后面）
noun <- "(a|the) +([^ ]+)"
sentences %>% str_subset(noun) %>% head(2) -> has_noun
```
]

.pull-left.code85[

```r
has_noun %>% str_extract(noun)
```

```
#> [1] "the smooth" "the sheet"
```
]

.pull-right.code85[

```r
has_noun %>% str_match(noun)
```

```
#>      [,1]         [,2]  [,3]    
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet"  "the" "sheet"
```
]

.code90[

```r
# 在tibble中使用tidyr::extract()可能更方便
tibble(sentence = sentences) %>% 
  extract(sentence, c("article", "noun"), "(a|the) +([^ ]+)", remove = FALSE)
```

```
#> # A tibble: 720 × 3
#>   sentence                                    article noun  
#>   <chr>                                       <chr>   <chr> 
#> 1 The birch canoe slid on the smooth planks.  the     smooth
#> 2 Glue the sheet to the dark blue background. the     sheet 
#> 3 It's easy to tell the depth of a well.      the     depth 
#> # … with 717 more rows
```
]

---

.full-width[.content-box-blue.bold.font120[最后让我们看下 `stringr` 包中究竟有多少个以 `str_` 开头的函数]]

.code85[

```r
*ls("package:stringr", pattern = "^str_") %>%
  matrix(ncol = 5, byrow = TRUE) -> strfuns

strfuns[9, 3:5] <- "-"

strfuns %>% as_tibble(.name_repair = "unique") %>% print(n = 9)
```

```
#> # A tibble: 9 × 5
#>   ...1           ...2            ...3            ...4            ...5          
#>   <chr>          <chr>           <chr>           <chr>           <chr>         
#> 1 str_c          str_conv        str_count       str_detect      str_dup       
#> 2 str_ends       str_extract     str_extract_all str_flatten     str_glue      
#> 3 str_glue_data  str_interp      str_length      str_locate      str_locate_all
#> 4 str_match      str_match_all   str_order       str_pad         str_remove    
#> 5 str_remove_all str_replace     str_replace_all str_replace_na  str_sort      
#> 6 str_split      str_split_fixed str_squish      str_starts      str_sub       
#> 7 str_sub<-      str_subset      str_to_lower    str_to_sentence str_to_title  
#> 8 str_to_upper   str_trim        str_trunc       str_view        str_view_all  
#> 9 str_which      str_wrap        -               -               -
```
]

---
layout: false
class: inverse, center, middle
background-image: url(imgs/logo-forcats.png), url(imgs/sxc.png)
background-size: 10%, 100%
background-position: 30% 40%, 0% 100%

# 2. 因子

.font120[(.bold[`forcats`<sup>.font60[v0.5.2]</sup>]: Tools for Working with Categorical Variables (Factors))]

---
layout: true

### >> 2.1 因子向量

---
.full-width[.content-box-blue.bold.font120[我们可以将字符串直接存入字符向量中 ...]]

```r
x1 <- c("Excelence", "Good", "Average", "Pass", "Failed")
```

.full-width[.content-box-blue.bold.font120[... 但以字符方式存储数据可能面临两个问题]]

.font120[
* 取值水平问题：尽管我们知道成绩等级水平只有5个，但字符方式无法保证你没有输入错误
]

```r
x2 <- c("Excelence", "Good", "Averege", "Pass", "Failed") # "Averege" 输入错了
```

.font120[
* 排序问题：可能无法得到有意义的排序结果 <sup>.red[*]</sup>
]

```r
sort(x1)
```

```
#> [1] "Average"   "Excelence" "Failed"    "Good"      "Pass"
```

.footnote.red.font100[*：而我们在制表、作图等场合通常需要保持合理的排序信息。]

---

.pull-left[

.full-width[.content-box-blue.bold.font120[创建因子向量 ...]]

```r
# 1. 设定合法的因子水平及其次序
grade_levels <- c(
  "Failed", "Pass", "Average",
  "Good", "Excelence"
)

# 2. 将字符向量转变为因子向量
y1 <- factor(x1, 
             levels = grade_levels)
y1
```

```
#> [1] Excelence Good      Average   Pass      Failed   
#> Levels: Failed Pass Average Good Excelence
```
]

.pull-right[

.full-width[.content-box-blue.bold.font120[... 有助于克服这两个问题]]

```r
#1. 不合法的字符取值将变为`NA`
y2 <- factor(x2, 
             levels = grade_levels)
y2
```

```
#> [1] Excelence Good      <NA>      Pass      Failed   
#> Levels: Failed Pass Average Good Excelence
```

```r
#2. 排序正确
sort(y1)
```

```
#> [1] Failed    Pass      Average   Good      Excelence
#> Levels: Failed Pass Average Good Excelence
```
]

---

.full-width[.content-box-blue.bold.font120[其它说明]]

* R 以整数存储因子向量 <br>

.pull-left[

* 若使用 `readr::parse_factor()`，出现不合法的输入时会给你提醒

```r
  y2 <- parse_factor(x2, 
              levels = grade_levels)
  ```
  
  ```
  #> Warning: 1 parsing failure.
  #> row col           expected  actual
  #>   3  -- value in level set Averege
  ```
]

.pull-right[

* 若不指定因子水平的 `levels` 参数，则以字符排序作为因子顺序

```r
  factor(x1)
  as.integer(factor(x1))
  # unclass(factor(x1))
  ```
  
  ```
  #> [1] Excelence Good      Average   Pass      Failed   
  #> Levels: Average Excelence Failed Good Pass
  #> [1] 2 4 1 5 3
  ```
]

---
layout: true

### >> 2.2 修改因子水平

---

.pull-left.code95[

```r
# 示例数据集
gss_cat
```

```
#> # A tibble: 21,483 × 9
#>     year marital         age race 
#>    <int> <fct>         <int> <fct>
#>  1  2000 Never married    26 White
#>  2  2000 Divorced         48 White
#>  3  2000 Widowed          67 White
#>  4  2000 Never married    39 White
#>  5  2000 Divorced         25 White
#>  6  2000 Married          25 White
#>  7  2000 Never married    36 White
#>  8  2000 Divorced         44 White
#>  9  2000 Married          44 White
#> 10  2000 Married          47 White
#> # … with 21,473 more rows, and 5
#> #   more variables: rincome <fct>,
#> #   partyid <fct>, relig <fct>,
#> #   denom <fct>, tvhours <int>
```
]

.pull-right.code100[

```r
# 查看特定变量的因子水平及其计数
gss_cat %>% 
  count(partyid)
```

```
#> # A tibble: 10 × 2
#>    partyid                n
#>    <fct>              <int>
#>  1 No answer            154
#>  2 Don't know             1
#>  3 Other party          393
#>  4 Strong republican   2314
#>  5 Not str republican  3032
#>  6 Ind,near rep        1791
#>  7 Independent         4119
#>  8 Ind,near dem        2499
#>  9 Not str democrat    3690
#> 10 Strong democrat     3490
```
]

---
.full-width[.content-box-blue.bold.font120[`fct_recode()`：手动改变因子水平]]
--

.code95[

```r
gss_cat %>% mutate(
  partyid = fct_recode(
    partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
*   "Other"                 = "No answer",
*   "Other"                 = "Don't know",
*   "Other"                 = "Other party"
    # fct_recode()中未指定的因子水平 Independent 将保持原样
  )
) %>% count(partyid)
```

```
#> # A tibble: 8 × 2
#>   partyid                n
#>   <fct>              <int>
#> 1 Other                548
#> 2 Republican, strong  2314
#> 3 Republican, weak    3032
#> # … with 5 more rows
```
]

---

.full-width[.content-box-blue.bold.font120[`fct_collapse()`：合并因子水平]]
--

.code90[

```r
# fct_collapse()函数是fct_recode()函数的一种变体
gss_cat %>% mutate(
  partyid = fct_collapse(
    partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
    # 左边满足R合法变量名的因子水平无需使用""（但要用""也行）
    # 因子水平的顺序以在`fct_collapse()`中出现的先后为序
  )
) %>% count(partyid)
```

```
#> # A tibble: 4 × 2
#>   partyid     n
#>   <fct>   <int>
#> 1 other     548
#> 2 rep      5346
#> 3 ind      8409
#> 4 dem      7180
```
]

---

.full-width[.content-box-blue.bold.font120[`fct_lump*()`：将最不常见/最常见的因子水平归并为 "Other"]]
--

.pull-left.code90[

```r
gss_cat %>%
  count(relig, sort = TRUE)
```

```
#> # A tibble: 15 × 2
#>    relig                       n
#>    <fct>                   <int>
#>  1 Protestant              10846
#>  2 Catholic                 5124
#>  3 None                     3523
#>  4 Christian                 689
#>  5 Jewish                    388
#>  6 Other                     224
#>  7 Buddhism                  147
#>  8 Inter-nondenominational   109
#>  9 Moslem/islam              104
#> 10 Orthodox-christian         95
#> 11 No answer                  93
#> 12 Hinduism                   71
#> 13 Other eastern              32
#> 14 Native american            23
#> 15 Don't know                 15
```
]

.pull-right.code90[

```r
gss_cat %>%
  mutate(
    relig = fct_lump_n(
      relig, 
*     n = 5,  # 负数归并最常见的水平
*     other_level = "Other"  # 默认
    )
  ) %>% 
  count(relig, sort = TRUE)
```

```
#> # A tibble: 6 × 2
#>   relig          n
#>   <fct>      <int>
#> 1 Protestant 10846
#> 2 Catholic    5124
#> 3 None        3523
*#> 4 Other        913
#> 5 Christian    689
#> 6 Jewish       388
```
]

---
layout: true

### >> 2.3 调整因子次序

---
.full-width[.content-box-blue.bold.font120[有时默认的因子次序并不满足你的要求 ...]]

.pull-left.code95[

```r
# 探索分析信仰不同宗教的人在
# 每日看电视时间方面的差异
relig_summary <- gss_cat %>%
* group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
*   tvhours = mean(tvhours,
*                  na.rm = TRUE),
    n = n()
  )

# 作图
ggplot(relig_summary, 
       aes(tvhours, relig)) + 
  geom_point()
```
]

.pull-right[
<img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-59-1.png" width="100%" style="display: block; margin: auto;" />

- 很难从上图中看出特定模式并进行相应的分析，😢
]

---

.full-width[.content-box-blue.bold.font100[`fct_reorder()`：基于另一变量取值的排序来调整因子次序]]
--

.code95[

```r
# fct_reorder(.f, .x, .fun = median, ..., .desc = FALSE)
relig_summary %>% 
  ggplot(aes(tvhours, fct_reorder(relig, tvhours))) + 
  geom_point()
```
]

.pull-left[
<img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-60-1.png" width="100%" />
]

.pull-right.code95[

```r
# 左图的纵坐标标题有点奇怪 ->
# 作图前先生成新变量可能是个更好的做法
relig_summary %>% 
* mutate(
*   relig = fct_reorder(relig,
*                       tvhours)
* ) %>%
  ggplot(aes(tvhours, relig)) + 
  geom_point()
```
]

---

.full-width[.content-box-blue.bold[`fct_relevel()`：手动调整因子次序]]
--

.code90[

```r
# fct_relevel(.f, ..., after = 0L)
relig_summary %>% 
  mutate(relig = fct_relevel(relig, "No answer", after = Inf)) %>% 
  arrange(relig)
```
]

.full-width[.content-box-blue.bold[`fct_infreq() | fct_inorder() | fct_inseq()`：分别根据频数从大到小 | 在数据集中出现先后 | 数字顺序来调整因子次序]]

.full-width[.content-box-blue.bold[`fct_rev()`：反转因子次序]]
--

.pull-left.code90[

```r
gss_cat %>%
* mutate(marital =
*          marital %>%
*          fct_infreq() %>%
*          fct_rev()) %>%
  ggplot(aes(marital)) + geom_bar()
```
]

.pull-right[
<img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-63-1.png" width="90%" style="display: block; margin: auto;" />
]

---
layout: false
class: inverse, center, middle
background-image: url(imgs/logo-lubridate.png), url(imgs/sxc.png)
background-size: 10%, 100%
background-position: 25% 40%, 0% 100%

# 3. 日期-时间

.font120[(.bold[`lubridate`<sup>.font60[v1.9.0]</sup>]: Make Dealing with Dates a Little Easier)]

---
layout: true

### >> 3.1 创建日期-时间

---

.full-width[.content-box-blue.bold.font120[`readr::parse_datetime() | parse_date() | parse_time()`]]

.code85[

```r
parse_datetime(x, format = "", na = c("", "NA"), locale = default_locale())
# format: If unset, parsed as ISO8601 with formats specified in the locale().
```
]

.pull-left.code85[

```r
- Year: "%Y"(4 digits); "%y"(2 digits)
- Month:
  "%m"(2 digits) 
  "%b"(abbr. name in locale()) 
  "%B"(full name in locale())
- Day: 
  "%d"(2 digits)
  "%e"(optional leading space)
- Hour:
  "%H"(0-23)
  "%I"(0-12, used with "%p")
- AM/PM indicator: "%p"
- Minutes: "%M"
- Seconds: 
  "%S"(integer seconds)
  "%OS"(real seconds)
```
]

.pull-right.code85[

```r
- Time zone:
  "%Z"(name, e.g. "America/Chicago")
  "%z"(offset from UTC, e.g. "+0800")
- Non-digits:
  "%." skips one non-digit
  "%+" skips one or more non-digits
  "%*" skips any number non-digits
```

```r
parse_datetime("2022-11-10 12:00")
```

```
#> [1] "2022-11-10 12:00:00 UTC"
```

```r
parse_date("11/10/y21", "%m/%d/%.%y")
```

```
#> [1] "2021-11-10"
```
]

---

.full-width[.content-box-blue.bold.font120[`lubridate::ymd_hms() | ymd() ...`]]

.pull-left.code95[

```r
*library(lubridate) # 手动加载
```

```r
ymd("2022-11-10")
```

```
#> [1] "2022-11-10"
```

```r
# All are Date 2022-11-10
dmy("10-11, 2022")

ymd(20221110)  # unquoted numbers

# Sys.getlocale("LC_TIME")
ymd("2022十一月10日") # “日”会被忽略
```
]

.pull-right.code95[

```r
# truncated formats
ymd("2022", truncated = 2)
```

```
#> [1] "2022-01-01"
```

```r
ymd_hms("2022-11-10 20:11:59")
```

```
#> [1] "2022-11-10 20:11:59 UTC"
```

```r
mdy_hm("11-10 2022 !08:11 PM")
```

```
#> [1] "2022-11-10 20:11:00 UTC"
```

```r
# <date> with tz -> <dttm>
ymd(20221110, tz = "UTC")
```

```
#> [1] "2022-11-10 UTC"
```
]

---

.full-width[.content-box-blue.bold.font120[`lubridate::make_datetime() | make_date()`]]

.code90[

```r
(flights_dt <- nycflights13::flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_dttm = make_datetime(
      year, month, day,               # year, month, day
      dep_time%/%100, dep_time%%100   # hour, min,   sec = 0
    ),
    arr_dttm = make_datetime(
      year, month, day,               # year, month, day
      arr_time%/%100, arr_time%%100   # hour, min,   sec = 0
    )
  ) %>% 
  select(origin, dest, dep_time, arr_time, ends_with("_dttm")))
```

```
#> # A tibble: 328,063 × 6
#>   origin dest  dep_time arr_time dep_dttm            arr_dttm           
#>   <chr>  <chr>    <int>    <int> <dttm>              <dttm>             
#> 1 EWR    IAH        517      830 2013-01-01 05:17:00 2013-01-01 08:30:00
#> 2 LGA    IAH        533      850 2013-01-01 05:33:00 2013-01-01 08:50:00
#> 3 JFK    MIA        542      923 2013-01-01 05:42:00 2013-01-01 09:23:00
#> # … with 328,060 more rows
```
]

---

.full-width[.content-box-blue.bold.font120[`lubridate::make_datetime() | make_date()`]]

.code95[

```r
flights_dt %>% 
* filter(dep_dttm < ymd(20130102)) %>%    # <dttm> 可和 <date> 直接比较大小
  ggplot(aes(dep_dttm)) + 
  geom_freqpoly(binwidth = 600) +         # 600s = 10 minutes
  scale_x_datetime(date_labels = "%H:%M") # 设定坐标轴时间刻度标签的显示格式
```

<img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-75-1.png" width="55%" style="display: block; margin: auto;" />
]

---

.full-width[.content-box-blue.bold.font120[`lubridate::as_datetime() | as_date()`]]

.pull-left.code100[

```r
# S4 method for 
# signature 'POSIXt' and 'Date'
as_datetime(today())
```

```
#> [1] "2022-11-17 UTC"
```

```r
as_date(now())
```

```
#> [1] "2022-11-17"
```
]

.pull-right.code90[

```r
# S4 method for 
# signature 'numeric'
# 默认参数origin = lubridate::origin
#         tz = "UTC"
as_datetime(60 * 60 * 10)
```

```
#> [1] "1970-01-01 10:00:00 UTC"
```

```r
# 默认参数origin = lubridate::origin
as_date(365 * 10 + 2)      
```

```
#> [1] "1980-01-01"
```

```r
# S4 method for 
# signature 'character'
as_datetime("2019/10/14 10:00", 
    format = "%Y/%m/%d %H:%M")
```

```
#> [1] "2019-10-14 10:00:00 UTC"
```
]

---
layout: true

### >> 3.2 处理时间成分

---

.full-width[.content-box-blue.bold.font120[`lubridate::year() | month() ...`：提取|更改时间成分]]

.pull-left.code80[

```r
datetime <- ymd_hm("2022/11/10 10:00")

# 提取时间成分
# -------------------------------
month(datetime, label = TRUE)
```

```
#> [1] Nov
#> Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
```

```r
mday(datetime)
```

```
#> [1] 10
```

```r
yday(datetime)
```

```
#> [1] 314
```

```r
wday(datetime, label = TRUE)
```

```
#> [1] Thu
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
```
]

.pull-right.code85[

```r
# 更改时间成分
# -------------------------------
# 就地修改datetime
year(datetime) <- 2020
datetime
```

```
#> [1] "2020-11-10 10:00:00 UTC"
```

```r
wday(datetime) <- wday(datetime) + 1
datetime
```

```
#> [1] "2020-11-11 10:00:00 UTC"
```

```r
# 同时更改多个成分并生成新的日期-时间
# 原来的datetime保持不变
update(datetime, year = 2020, 
       month = 2, hour = 2)
```

```
#> [1] "2020-02-11 02:00:00 UTC"
```
]

---

.full-width[.content-box-blue.bold.font120[`lubridate::year() | month() ...`：提取|更改时间成分]]

```r
flights_dt %>% 
* mutate(wday = wday(dep_dttm, label = TRUE)) %>%
  ggplot(aes(x = wday)) +
    geom_bar()
```

---

.full-width[.content-box-blue.bold.font120[`lubridate::floor_date() ...`：时间成分取整]]

```r
# floor_date(x, unit = "second", week_start = getOption("lubridate.week.start", 7))
# unit: second, minute, hour, day, week, month, year, 
#       bimonth, quarter, season, halfyear; 3 months, etc.
flights_dt %>% 
* count(week = floor_date(dep_dttm, unit = "week")) %>%
  ggplot(aes(week, n)) + geom_line() + 
  scale_x_datetime(date_labels = "%YW%U")  # ?strptime
```

---
layout: true

### >> 3.3 时间间隔（time spans）

---

`lubridate`包提供三种记录/处理时间间隔的方式：
*   _`duration`_：用秒精确记录/处理，`dseconds() | dhours() ...`
*   _`period`_：用人们熟悉的方式记录/处理，`seconds() | hours() ...`
*   _`interval`_：记录起点-终点之间的时间间隔，`%--%`

.pull-left.code80[

```r
# Durations
# ---------------------------------------
dyears(1)
```

```
#> [1] "31557600s (~1 years)"
```

```r
# Periods
# ---------------------------------------
years(1)
```

```
#> [1] "1y 0m 0d 0H 0M 0S"
```

```r
# Intervals
# ---------------------------------------
today() %--% (today() + years(1))
```

```
#> [1] 2022-11-17 UTC--2023-11-17 UTC
```
]

.pull-right.code80[

```r
# A leap year
ymd("2016-01-01") + dyears(1)  # ?!
```

```
#> [1] "2016-12-31 06:00:00 UTC"
```

```r
ymd("2016-01-01") + years(1)
```

```
#> [1] "2017-01-01"
```

```r
# Arithmetic operations
y1 <- ymd("2016-01-01")
y2 <- y1 + years(1)
(y1 %--% y2) / ddays(1)
```

```
#> [1] 366
```

```r
(y1 %--% y2) %/% days(1)
```

```
#> [1] 366
```
]

---

.full-width[.content-box-blue.bold.font120[不同数据类型之间允许的运算]]

.code90[

```r
# 示例
flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_dttm < dep_dttm,
    arr_dttm = arr_dttm + days(overnight * 1),
  )
# Now all flights obey the laws of physics
flights_dt %>% filter(overnight, arr_dttm < dep_dttm) 
```

```
#> # A tibble: 0 × 7
#> # … with 7 variables: origin <chr>, dest <chr>, dep_time <int>, arr_time <int>,
#> #   dep_dttm <dttm>, arr_dttm <dttm>, overnight <lgl>
```
]

---
layout: false
class: inverse, center, middle

# 课后作业

---
class: middle

### 课后作业

.Large[

1. 根据本讲讲义的打印稿，在 📑 _Rmd_ 中键入并完成代码的运行

1. **复习** 📖 [_R for Data Science_](https://r4ds.had.co.nz/) 一书关于 [字符串](https://r4ds.had.co.nz/strings.html)、[因子](https://r4ds.had.co.nz/factors.html) 和 [日期-时间](https://r4ds.had.co.nz/dates-and-times.html) 等三种重要数据类型以及相应处理函数的内容（为中文翻译版的第10章、第11章和第12章）

1. 下载（打印）📰 [.bold[{{`stringr`的cheatsheet}}]](https://posit.co/wp-content/uploads/2022/10/strings.pdf)、[.bold[{{`forcats`的cheatsheet}}]](https://posit.co/wp-content/uploads/2022/10/factors.pdf) 和 [.bold[{{`lubridate`的cheatsheet}}]](https://posit.co/wp-content/uploads/2022/10/lubridate.pdf) 并阅读之

1. 完成第八讲的课后练习 👩‍💻
  
> .code100[

```r
remotes::install_github("qfwr2022/qfwr")
library(qfwr)
qfwr_ex("L08")
```
]
]

---
class: center, middle, hide_logo
background-image: url(imgs/xaringan.png)
background-size: 12%
background-position: 50% 40%

<br><br><br><br><br><br><br>
<hr color='#f00' size='2px' width='80%'>
<br>
.Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能！**_]