class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ##
L08 字符串、因子和日期-时间 ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2022-11-18 ] --- class: middle background-image: url(imgs/s-f-l.png) background-size: 15em background-position: 90% 85%
-- .pull-left[ ### 1. 字符串 * .font130[字符串基础] * .font130[模式匹配] * .font130[工具函数] <br> ### 2. 因子 * .font130[因子向量] * .font130[修改因子水平] * .font130[调整因子次序] ] .pull-right[ ### 3. 日期-时间 * .font130[创建日期-时间] * .font130[处理时间成分] * .font130[时间间隔] ] --- class: inverse, center, middle background-image: url(imgs/logo-stringr.png), url(imgs/sxc.png) background-size: 10%, 100% background-position: 30% 40%, 0% 100% # 1. 字符串 .font120[(.bold[`stringr`<sup>.font60[v1.4.1]</sup>]: Simple, Consistent Wrappers for Common String Operations)] --- layout: true ### >> 1.1 字符串基础 --- ```r # 用 '' 或 "" 来创建字符串,但优先用 "" x <- "This is a string."; x_zh <- '这是中文字符串。' ``` -- ```r # 若想在字符串中包括'或",你需要使用转义符 \ x <- '"This is a string enclosed in \" & \"!"' # ?'"' x # 打印字符串 ``` ``` #> [1] "\"This is a string enclosed in \" & \"!\"" ``` ```r writeLines(x) # 写出字符串的内容 ``` ``` #> "This is a string enclosed in " & "!" ``` -- ```r # 可直接输入给定Unicode编码表代码的字符 (zh <- c("\u4e00", "\u9fa5")) # Unicode编码表中第一个和最后一个中文字符 ``` ``` #> [1] "一" "龥" ``` -- ```r # 用c()创建字符向量来存储多个字符串 (greeting <- c("Good morning", "同学们", "!")) ``` ``` #> [1] "Good morning" "同学们" "!" ``` --- .full-width[.content-box-blue.bold.font120[字符串长度]] ```r c("Good morning", "同学们", "!") %>% str_length() # {stringr} # c("Good morning", "同学们", "!") %>% nchar() # {base} ``` ``` #> [1] 12 3 1 ``` -- .full-width[.content-box-blue.bold.font120[字符串连接]] ```r str_c("X", str_replace_na(c("a", NA, "c"), replacement = "NA"), 1:3, sep = "_") # paste("X", c("a", NA, "c"), 1:3, sep = "_") ``` ``` #> [1] "X_a_1" "X_NA_2" "X_c_3" ``` ```r str_c("X", str_replace_na(c("a", NA, "c")), 1:3, sep = "_", * collapse = " |-> ") paste0("X", c("a", NA, "c"), 1:3, sep = "_", collapse = " |-> ") str_flatten(c("X", str_replace_na(c("a", NA, "c")), 1:3, sep = "_"), collapse = " |-> ") ``` ``` #> [1] "X_a_1 |-> X_NA_2 |-> X_c_3" #> [1] "Xa1_ |-> XNA2_ |-> Xc3_" #> [1] "X |-> a |-> NA |-> c |-> 1 |-> 2 |-> 3 |-> _" ``` --- .full-width[.content-box-blue.bold.font120[提取(修改)子字符串]] .code110[ ```r greeting ``` ``` #> [1] "Good morning" "同学们" "!" ``` ```r str_sub(greeting, start = 1:3, end = -2) # 提取 ``` ``` #> [1] "Good mornin" "学" "" ``` ```r str_sub(greeting, 1:3, -2) <- LETTERS[1:3] # 修改 greeting ``` ``` #> [1] "Ag" "同B们" "!C" ``` ```r # base::substr(x, start, stop) ``` ] --- layout: true ### >> 1.2 模式匹配 --- .full-width[.content-box-blue.bold.font120[`stringr` 包多数函数都有 `pattern` 参数,默认调用正则表达式函数 `regex()`]] > 正则表达式(_Regular Expression, RegEx_)是一种用来精简描述字符串模式的语言。 > `stringr` 包采用 `ICU` 正则表达式引擎(类似于 `Perl`)。 -- .pull-left[ ```r x <- c("apple & banana", "ana@财管2020") # 基础匹配(精确匹配指定字符) # ----------------------------- str_view(x, pattern = "na") ```
```r # "." 匹配任意一个字符(换行符除外) # ----------------------------- str_view(x, pattern = "..a..") ```
] -- .pull-right[ ```r # 用转义符"\"转义 # ----------------------------- # 正则表达式"\\."匹配字符"." str_view(c("abc", "a.c"), "a\\.c") ```
```r # 需要用"\\\\"来匹配一个字符"\"! # str_view("a\\c", "a\\c") str_view("a\\c", "a\\\\c") ```
] --- .pull-left.code100[ ```r # 锚定符"^"和"$" # ---------------------------- str_view(x, "^a") # ^:开头 ```
```r str_view(x, "a$") # $:结尾 ```
```r str_view(x, "^ana$") ```
] -- .pull-right.code95[ ```r # 匹配一组字符中的任意一个 # ----------------------------- # "\\d"匹配任意阿拉伯数字 # "\\s"匹配任意空白符 # (如空格符、tab、换行符等) # "[abc.]"匹配a或b或c或.(无需转义) # "|"为或,"(a|b|c)"也匹配a或b或c # "[a-c]"匹配从a到c的任意字符 *# "[^abc]"匹配非a非b和非c的任意字符 *str_view_all(x, "\\d") ```
```r str_view_all(x, "[\u4e00-\u9fa5]") ```
] --- .pull-left.code95[ ```r # 重复 # ----------------------------- # ?:0或1次 # +:至少1次 # *:任意次数 # {n}:正好n次 # {n,}:至少n次 # {,m}:至多m次 # {n,m}:n次和m次之间 # 从左到右依次查找直至匹配 str_view(x, "(a.)*") ```
```r str_view(x, "(a.){2,}") ```
] -- .pull-right.code100[ ```r # 重复 # ----------------------------- # 默认为“贪婪”,最长匹配字符串 # 加上?修改为“懒惰”,最短匹配字符串 str_view(x, "(a.){1,}") ```
```r str_view(x, "(a.){1,}?") ```
] --- .pull-left.code100[ ```r # 分组与回溯引用(backreference) # ----------------------------- # ():字符串分组 # (使用()也可使得RegEx更清晰) # \1:回溯引用第1组匹配字符串 str_view(x, "(.).*\\1") ```
```r str_view(x, "(.)(.).\\2\\1") ```
] -- .pull-right.code90[ ```r # regex()的参数 # ----------------------------- # ignore_case = FALSE # multiline = FALSE # comments = FALSE # dotall = FALSE str_view("AB", regex("ab", ignore_case = TRUE)) ```
```r xd <- regex(" \\(? # 可能的( (0592) # 四位区号0592,第1组 [^\\d]* # 可能的非数字,如)、-等 (218\\d{4}) # 218开头的7个号码,第2组 ", comments = TRUE) str_match("0592-2187091", xd) ``` ``` #> [,1] [,2] [,3] #> [1,] "0592-2187091" "0592" "2187091" ``` ] --- .full-width[.content-box-blue.bold.font120[`pattern` 参数还支持 `fixed()`、`coll()` 和 `boundary()`]] -- .pull-left.code95[ ```r # fixed()会忽略特殊的RegEx字符 # ----------------------------- str_view_all("\\d for 123...", fixed("\\d")) ```
```r # coll()按标准字符分类规则匹配字符 # ----------------------------- str_subset( c("I", "İ", "i", "ı"), coll("i", ignore_case = TRUE, locale = "tr") ) ``` ``` #> [1] "İ" "i" ``` ] -- .pull-right.code95[ ```r # boundary(type = ...)匹配边界 # ----------------------------- # "character":默认按字符 # "line_break":按行 # "sentence":按句 # "word":按词 str_view_all(c("字符串与因子"), boundary()) ```
```r str_view_all(c("字符串与因子"), boundary("word")) ```
] --- layout: true ### >> 1.3 工具函数 --- .full-width[.content-box-blue.bold.font120[匹配与否及匹配次数:`str_detect()` 和 `str_count()`]] -- .pull-left.code90[ ```r # x向量中哪些字符串包含数字? str_detect(x, "\\d") ``` ``` #> [1] FALSE TRUE ``` ```r # 内置向量words中以元音字母结尾词占比 mean(str_detect(words, "[aeiou]$")) ``` ``` #> [1] 0.277 ``` ```r # 提取words中以字母x结尾的词 words[str_detect(words, "x$")] ``` ``` #> [1] "box" "sex" "six" "tax" ``` ```r # 等同于 str_subset(words, "x$") ``` ``` #> [1] "box" "sex" "six" "tax" ``` ] -- .pull-right.code85[ ```r # 在数据框中搭配filter()使用 df <- tibble(word = words, i = seq_along(word)) df %>% filter(str_detect(words, "x$")) ``` ``` #> # A tibble: 4 × 2 #> word i #> <chr> <int> #> 1 box 108 #> 2 sex 747 #> 3 six 772 #> 4 tax 841 ``` ```r # 使用mutate()+str_count()统计匹配次数 df %>% mutate( vo = str_count(words, "[^aeiou]")) ``` ``` #> # A tibble: 980 × 3 #> word i vo #> <chr> <int> <int> #> 1 a 1 0 #> 2 able 2 2 #> 3 about 3 2 #> # … with 977 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[替换匹配:`str_replace()` 和 `str_replace_all()`]] -- .pull-left.code95[ ```r # 将x向量中的元音字母替换为"*" str_replace(x, "[aeiou]", "*") ``` ``` #> [1] "*pple & banana" "*na@财管2020" ``` ```r # str_replace()只替换了第一个匹配 # str_replace_all()进行全部替换 # 如下可删除全部元音字母 str_replace_all(x, "[aeiou]", "") ``` ``` #> [1] "ppl & bnn" "n@财管2020" ``` ```r # 等同于 str_remove_all(x, "[aeiou]") ``` ``` #> [1] "ppl & bnn" "n@财管2020" ``` ] -- .pull-right.code95[ ```r # 通过指定命名向量完成多重替换 x2 <- c("1 car", "2 cats", "3 cups") str_replace_all(x2, c("1" = "one", "\\d" = "some")) ``` ``` #> [1] "one car" "some cats" "some cups" ``` ```r # 回溯引用替换 str_replace( x, "^(...)(.*)(...)$", "\\3\\2\\1" ) ``` ``` #> [1] "anale & banapp" "020@财管2ana" ``` ] --- .full-width[.content-box-blue.bold.font120[匹配分割:`str_split()` 和 `str_split_fixed()`]] -- .code90[ ```r info <- c("姓名: 张三 & 李四", "年龄 @ 20 & 21", "宿舍 @ L2-111") ``` ] -- .pull-left.code95[ ```r # 在给定pattern处 # 将字符串向量info分割为列表 info %>% str_split(" ?[:@&] ") # 默认参数n = Inf, simplify = FALSE ``` ``` #> [[1]] #> [1] "姓名" "张三" "李四" #> #> [[2]] #> [1] "年龄" "20" "21" #> #> [[3]] #> [1] "宿舍" "L2-111" ``` ] -- .pull-right.code85[ ```r # 分割为矩阵 info %>% str_split(" ?[:@&] ", n = 2, simplify = TRUE) ``` ``` #> [,1] [,2] #> [1,] "姓名" "张三 & 李四" #> [2,] "年龄" "20 & 21" #> [3,] "宿舍" "L2-111" ``` ```r # 分割为矩阵 info %>% str_split_fixed(" ?[:@&] ", n = 4) ``` ``` #> [,1] [,2] [,3] [,4] #> [1,] "姓名" "张三" "李四" "" #> [2,] "年龄" "20" "21" "" #> [3,] "宿舍" "L2-111" "" "" ``` ] --- .full-width[.content-box-blue.bold.font120[提取匹配:`str_extract()` 和 `str_extract_all()`]] -- .pull-left.code80[ ```r head(sentences, n = 5) # 720 in total ``` ``` #> [1] "The birch canoe slid on the smooth planks." #> [2] "Glue the sheet to the dark blue background." #> [3] "It's easy to tell the depth of a well." #> [4] "These days a chicken leg is a rare dish." #> [5] "Rice is often served in round bowls." ``` ```r # 想从中找出含有2种及以上彩虹颜色单词的句子 rainbow <- c("red", "orange", "yellow", "green", "blue", "indigo", "purple") rb_match <- str_c(rainbow, collapse = "|") rb_match ``` ``` #> [1] "red|orange|yellow|green|blue|indigo|purple" ``` ] -- .pull-right.code80[ ```r more <- sentences[ str_count(sentences, rb_match) >= 2] str_view_all(more, rb_match) ```
```r str_extract(more, rb_match) ``` ``` #> [1] "blue" "green" "orange" ``` ```r str_extract_all(more, rb_match, simplify = TRUE) ``` ``` #> [,1] [,2] #> [1,] "blue" "red" #> [2,] "green" "red" #> [3,] "orange" "red" ``` ] --- .full-width[.content-box-blue.bold.font120[分组提取匹配:`str_match()` 和 `str_match_all()`]] -- .code90[ ```r # 想从中sentences数据集中找出名词(使用启发式法则:在冠词a或the后面) noun <- "(a|the) +([^ ]+)" sentences %>% str_subset(noun) %>% head(2) -> has_noun ``` ] -- .pull-left.code85[ ```r has_noun %>% str_extract(noun) ``` ``` #> [1] "the smooth" "the sheet" ``` ] -- .pull-right.code85[ ```r has_noun %>% str_match(noun) ``` ``` #> [,1] [,2] [,3] #> [1,] "the smooth" "the" "smooth" #> [2,] "the sheet" "the" "sheet" ``` ] -- .code90[ ```r # 在tibble中使用tidyr::extract()可能更方便 tibble(sentence = sentences) %>% extract(sentence, c("article", "noun"), "(a|the) +([^ ]+)", remove = FALSE) ``` ``` #> # A tibble: 720 × 3 #> sentence article noun #> <chr> <chr> <chr> #> 1 The birch canoe slid on the smooth planks. the smooth #> 2 Glue the sheet to the dark blue background. the sheet #> 3 It's easy to tell the depth of a well. the depth #> # … with 717 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[最后让我们看下 `stringr` 包中究竟有多少个以 `str_` 开头的函数]] .code85[ ```r *ls("package:stringr", pattern = "^str_") %>% matrix(ncol = 5, byrow = TRUE) -> strfuns strfuns[9, 3:5] <- "-" strfuns %>% as_tibble(.name_repair = "unique") %>% print(n = 9) ``` ``` #> # A tibble: 9 × 5 #> ...1 ...2 ...3 ...4 ...5 #> <chr> <chr> <chr> <chr> <chr> #> 1 str_c str_conv str_count str_detect str_dup #> 2 str_ends str_extract str_extract_all str_flatten str_glue #> 3 str_glue_data str_interp str_length str_locate str_locate_all #> 4 str_match str_match_all str_order str_pad str_remove #> 5 str_remove_all str_replace str_replace_all str_replace_na str_sort #> 6 str_split str_split_fixed str_squish str_starts str_sub #> 7 str_sub<- str_subset str_to_lower str_to_sentence str_to_title #> 8 str_to_upper str_trim str_trunc str_view str_view_all #> 9 str_which str_wrap - - - ``` ] --- layout: false class: inverse, center, middle background-image: url(imgs/logo-forcats.png), url(imgs/sxc.png) background-size: 10%, 100% background-position: 30% 40%, 0% 100% # 2. 因子 .font120[(.bold[`forcats`<sup>.font60[v0.5.2]</sup>]: Tools for Working with Categorical Variables (Factors))] --- layout: true ### >> 2.1 因子向量 --- .full-width[.content-box-blue.bold.font120[我们可以将字符串直接存入字符向量中 ...]] ```r x1 <- c("Excelence", "Good", "Average", "Pass", "Failed") ``` -- .full-width[.content-box-blue.bold.font120[... 但以字符方式存储数据可能面临两个问题]] .font120[ * 取值水平问题:尽管我们知道成绩等级水平只有5个,但字符方式无法保证你没有输入错误 ] ```r x2 <- c("Excelence", "Good", "Averege", "Pass", "Failed") # "Averege" 输入错了 ``` .font120[ * 排序问题:可能无法得到有意义的排序结果 <sup>.red[*]</sup> ] ```r sort(x1) ``` ``` #> [1] "Average" "Excelence" "Failed" "Good" "Pass" ``` .footnote.red.font100[*:而我们在制表、作图等场合通常需要保持合理的排序信息。] --- .pull-left[ .full-width[.content-box-blue.bold.font120[创建因子向量 ...]] ```r # 1. 设定合法的因子水平及其次序 grade_levels <- c( "Failed", "Pass", "Average", "Good", "Excelence" ) # 2. 将字符向量转变为因子向量 y1 <- factor(x1, levels = grade_levels) y1 ``` ``` #> [1] Excelence Good Average Pass Failed #> Levels: Failed Pass Average Good Excelence ``` ] -- .pull-right[ .full-width[.content-box-blue.bold.font120[... 有助于克服这两个问题]] ```r #1. 不合法的字符取值将变为`NA` y2 <- factor(x2, levels = grade_levels) y2 ``` ``` #> [1] Excelence Good <NA> Pass Failed #> Levels: Failed Pass Average Good Excelence ``` ```r #2. 排序正确 sort(y1) ``` ``` #> [1] Failed Pass Average Good Excelence #> Levels: Failed Pass Average Good Excelence ``` ] --- .full-width[.content-box-blue.bold.font120[其它说明]] * R 以整数存储因子向量 <br> <img src="imgs/factor-1.png" width="95%" style="display: block; margin: auto;" /> -- .pull-left[ * 若使用 `readr::parse_factor()`,出现不合法的输入时会给你提醒 ```r y2 <- parse_factor(x2, levels = grade_levels) ``` ``` #> Warning: 1 parsing failure. #> row col expected actual #> 3 -- value in level set Averege ``` ] -- .pull-right[ * 若不指定因子水平的 `levels` 参数,则以字符排序作为因子顺序 ```r factor(x1) as.integer(factor(x1)) # unclass(factor(x1)) ``` ``` #> [1] Excelence Good Average Pass Failed #> Levels: Average Excelence Failed Good Pass #> [1] 2 4 1 5 3 ``` ] --- layout: true ### >> 2.2 修改因子水平 --- .pull-left.code95[ ```r # 示例数据集 gss_cat ``` ``` #> # A tibble: 21,483 × 9 #> year marital age race #> <int> <fct> <int> <fct> #> 1 2000 Never married 26 White #> 2 2000 Divorced 48 White #> 3 2000 Widowed 67 White #> 4 2000 Never married 39 White #> 5 2000 Divorced 25 White #> 6 2000 Married 25 White #> 7 2000 Never married 36 White #> 8 2000 Divorced 44 White #> 9 2000 Married 44 White #> 10 2000 Married 47 White #> # … with 21,473 more rows, and 5 #> # more variables: rincome <fct>, #> # partyid <fct>, relig <fct>, #> # denom <fct>, tvhours <int> ``` ] -- .pull-right.code100[ ```r # 查看特定变量的因子水平及其计数 gss_cat %>% count(partyid) ``` ``` #> # A tibble: 10 × 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Strong republican 2314 #> 5 Not str republican 3032 #> 6 Ind,near rep 1791 #> 7 Independent 4119 #> 8 Ind,near dem 2499 #> 9 Not str democrat 3690 #> 10 Strong democrat 3490 ``` ] --- .full-width[.content-box-blue.bold.font120[`fct_recode()`:手动改变因子水平]] -- .code95[ ```r gss_cat %>% mutate( partyid = fct_recode( partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", * "Other" = "No answer", * "Other" = "Don't know", * "Other" = "Other party" # fct_recode()中未指定的因子水平 Independent 将保持原样 ) ) %>% count(partyid) ``` ``` #> # A tibble: 8 × 2 #> partyid n #> <fct> <int> #> 1 Other 548 #> 2 Republican, strong 2314 #> 3 Republican, weak 3032 #> # … with 5 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[`fct_collapse()`:合并因子水平]] -- .code90[ ```r # fct_collapse()函数是fct_recode()函数的一种变体 gss_cat %>% mutate( partyid = fct_collapse( partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") # 左边满足R合法变量名的因子水平无需使用""(但要用""也行) # 因子水平的顺序以在`fct_collapse()`中出现的先后为序 ) ) %>% count(partyid) ``` ``` #> # A tibble: 4 × 2 #> partyid n #> <fct> <int> #> 1 other 548 #> 2 rep 5346 #> 3 ind 8409 #> 4 dem 7180 ``` ] --- .full-width[.content-box-blue.bold.font120[`fct_lump*()`:将最不常见/最常见的因子水平归并为 "Other"]] -- .pull-left.code90[ ```r gss_cat %>% count(relig, sort = TRUE) ``` ``` #> # A tibble: 15 × 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Catholic 5124 #> 3 None 3523 #> 4 Christian 689 #> 5 Jewish 388 #> 6 Other 224 #> 7 Buddhism 147 #> 8 Inter-nondenominational 109 #> 9 Moslem/islam 104 #> 10 Orthodox-christian 95 #> 11 No answer 93 #> 12 Hinduism 71 #> 13 Other eastern 32 #> 14 Native american 23 #> 15 Don't know 15 ``` ] -- .pull-right.code90[ ```r gss_cat %>% mutate( relig = fct_lump_n( relig, * n = 5, # 负数归并最常见的水平 * other_level = "Other" # 默认 ) ) %>% count(relig, sort = TRUE) ``` ``` #> # A tibble: 6 × 2 #> relig n #> <fct> <int> #> 1 Protestant 10846 #> 2 Catholic 5124 #> 3 None 3523 *#> 4 Other 913 #> 5 Christian 689 #> 6 Jewish 388 ``` ] --- layout: true ### >> 2.3 调整因子次序 --- .full-width[.content-box-blue.bold.font120[有时默认的因子次序并不满足你的要求 ...]] -- .pull-left.code95[ ```r # 探索分析信仰不同宗教的人在 # 每日看电视时间方面的差异 relig_summary <- gss_cat %>% * group_by(relig) %>% summarise( age = mean(age, na.rm = TRUE), * tvhours = mean(tvhours, * na.rm = TRUE), n = n() ) # 作图 ggplot(relig_summary, aes(tvhours, relig)) + geom_point() ``` ] .pull-right[ <img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-59-1.png" width="100%" style="display: block; margin: auto;" /> - 很难从上图中看出特定模式并进行相应的分析,😢 ] --- .full-width[.content-box-blue.bold.font100[`fct_reorder()`:基于另一变量取值的排序来调整因子次序]] -- .code95[ ```r # fct_reorder(.f, .x, .fun = median, ..., .desc = FALSE) relig_summary %>% ggplot(aes(tvhours, fct_reorder(relig, tvhours))) + geom_point() ``` ] .pull-left[ <img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-60-1.png" width="100%" /> ] -- .pull-right.code95[ ```r # 左图的纵坐标标题有点奇怪 -> # 作图前先生成新变量可能是个更好的做法 relig_summary %>% * mutate( * relig = fct_reorder(relig, * tvhours) * ) %>% ggplot(aes(tvhours, relig)) + geom_point() ``` ] --- .full-width[.content-box-blue.bold[`fct_relevel()`:手动调整因子次序]] -- .code90[ ```r # fct_relevel(.f, ..., after = 0L) relig_summary %>% mutate(relig = fct_relevel(relig, "No answer", after = Inf)) %>% arrange(relig) ``` ] -- .full-width[.content-box-blue.bold[`fct_infreq() | fct_inorder() | fct_inseq()`:分别根据频数从大到小 | 在数据集中出现先后 | 数字顺序来调整因子次序]] -- .full-width[.content-box-blue.bold[`fct_rev()`:反转因子次序]] -- .pull-left.code90[ ```r gss_cat %>% * mutate(marital = * marital %>% * fct_infreq() %>% * fct_rev()) %>% ggplot(aes(marital)) + geom_bar() ``` ] .pull-right[ <img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-63-1.png" width="90%" style="display: block; margin: auto;" /> ] --- layout: false class: inverse, center, middle background-image: url(imgs/logo-lubridate.png), url(imgs/sxc.png) background-size: 10%, 100% background-position: 25% 40%, 0% 100% # 3. 日期-时间 .font120[(.bold[`lubridate`<sup>.font60[v1.9.0]</sup>]: Make Dealing with Dates a Little Easier)] --- layout: true ### >> 3.1 创建日期-时间 --- .full-width[.content-box-blue.bold.font120[`readr::parse_datetime() | parse_date() | parse_time()`]] -- .code85[ ```r parse_datetime(x, format = "", na = c("", "NA"), locale = default_locale()) # format: If unset, parsed as ISO8601 with formats specified in the locale(). ``` ] -- .pull-left.code85[ ```r - Year: "%Y"(4 digits); "%y"(2 digits) - Month: "%m"(2 digits) "%b"(abbr. name in locale()) "%B"(full name in locale()) - Day: "%d"(2 digits) "%e"(optional leading space) - Hour: "%H"(0-23) "%I"(0-12, used with "%p") - AM/PM indicator: "%p" - Minutes: "%M" - Seconds: "%S"(integer seconds) "%OS"(real seconds) ``` ] -- .pull-right.code85[ ```r - Time zone: "%Z"(name, e.g. "America/Chicago") "%z"(offset from UTC, e.g. "+0800") - Non-digits: "%." skips one non-digit "%+" skips one or more non-digits "%*" skips any number non-digits ``` ```r parse_datetime("2022-11-10 12:00") ``` ``` #> [1] "2022-11-10 12:00:00 UTC" ``` ```r parse_date("11/10/y21", "%m/%d/%.%y") ``` ``` #> [1] "2021-11-10" ``` ] --- .full-width[.content-box-blue.bold.font120[`lubridate::ymd_hms() | ymd() ...`]] -- .pull-left.code95[ ```r *library(lubridate) # 手动加载 ``` ```r ymd("2022-11-10") ``` ``` #> [1] "2022-11-10" ``` ```r # All are Date 2022-11-10 dmy("10-11, 2022") ymd(20221110) # unquoted numbers # Sys.getlocale("LC_TIME") ymd("2022十一月10日") # “日”会被忽略 ``` ] -- .pull-right.code95[ ```r # truncated formats ymd("2022", truncated = 2) ``` ``` #> [1] "2022-01-01" ``` ```r ymd_hms("2022-11-10 20:11:59") ``` ``` #> [1] "2022-11-10 20:11:59 UTC" ``` ```r mdy_hm("11-10 2022 !08:11 PM") ``` ``` #> [1] "2022-11-10 20:11:00 UTC" ``` ```r # <date> with tz -> <dttm> ymd(20221110, tz = "UTC") ``` ``` #> [1] "2022-11-10 UTC" ``` ] --- .full-width[.content-box-blue.bold.font120[`lubridate::make_datetime() | make_date()`]] -- .code90[ ```r (flights_dt <- nycflights13::flights %>% filter(!is.na(dep_time), !is.na(arr_time)) %>% mutate( dep_dttm = make_datetime( year, month, day, # year, month, day dep_time%/%100, dep_time%%100 # hour, min, sec = 0 ), arr_dttm = make_datetime( year, month, day, # year, month, day arr_time%/%100, arr_time%%100 # hour, min, sec = 0 ) ) %>% select(origin, dest, dep_time, arr_time, ends_with("_dttm"))) ``` ``` #> # A tibble: 328,063 × 6 #> origin dest dep_time arr_time dep_dttm arr_dttm #> <chr> <chr> <int> <int> <dttm> <dttm> #> 1 EWR IAH 517 830 2013-01-01 05:17:00 2013-01-01 08:30:00 #> 2 LGA IAH 533 850 2013-01-01 05:33:00 2013-01-01 08:50:00 #> 3 JFK MIA 542 923 2013-01-01 05:42:00 2013-01-01 09:23:00 #> # … with 328,060 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[`lubridate::make_datetime() | make_date()`]] -- .code95[ ```r flights_dt %>% * filter(dep_dttm < ymd(20130102)) %>% # <dttm> 可和 <date> 直接比较大小 ggplot(aes(dep_dttm)) + geom_freqpoly(binwidth = 600) + # 600s = 10 minutes scale_x_datetime(date_labels = "%H:%M") # 设定坐标轴时间刻度标签的显示格式 ``` <img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-75-1.png" width="55%" style="display: block; margin: auto;" /> ] --- .full-width[.content-box-blue.bold.font120[`lubridate::as_datetime() | as_date()`]] -- .pull-left.code100[ ```r # S4 method for # signature 'POSIXt' and 'Date' as_datetime(today()) ``` ``` #> [1] "2022-11-17 UTC" ``` ```r as_date(now()) ``` ``` #> [1] "2022-11-17" ``` ] -- .pull-right.code90[ ```r # S4 method for # signature 'numeric' # 默认参数origin = lubridate::origin # tz = "UTC" as_datetime(60 * 60 * 10) ``` ``` #> [1] "1970-01-01 10:00:00 UTC" ``` ```r # 默认参数origin = lubridate::origin as_date(365 * 10 + 2) ``` ``` #> [1] "1980-01-01" ``` ```r # S4 method for # signature 'character' as_datetime("2019/10/14 10:00", format = "%Y/%m/%d %H:%M") ``` ``` #> [1] "2019-10-14 10:00:00 UTC" ``` ] --- layout: true ### >> 3.2 处理时间成分 --- .full-width[.content-box-blue.bold.font120[`lubridate::year() | month() ...`:提取|更改时间成分]] -- .pull-left.code80[ ```r datetime <- ymd_hm("2022/11/10 10:00") # 提取时间成分 # ------------------------------- month(datetime, label = TRUE) ``` ``` #> [1] Nov #> Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec ``` ```r mday(datetime) ``` ``` #> [1] 10 ``` ```r yday(datetime) ``` ``` #> [1] 314 ``` ```r wday(datetime, label = TRUE) ``` ``` #> [1] Thu #> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat ``` ] -- .pull-right.code85[ ```r # 更改时间成分 # ------------------------------- # 就地修改datetime year(datetime) <- 2020 datetime ``` ``` #> [1] "2020-11-10 10:00:00 UTC" ``` ```r wday(datetime) <- wday(datetime) + 1 datetime ``` ``` #> [1] "2020-11-11 10:00:00 UTC" ``` ```r # 同时更改多个成分并生成新的日期-时间 # 原来的datetime保持不变 update(datetime, year = 2020, month = 2, hour = 2) ``` ``` #> [1] "2020-02-11 02:00:00 UTC" ``` ] --- .full-width[.content-box-blue.bold.font120[`lubridate::year() | month() ...`:提取|更改时间成分]] ```r flights_dt %>% * mutate(wday = wday(dep_dttm, label = TRUE)) %>% ggplot(aes(x = wday)) + geom_bar() ``` <img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-80-1.png" width="52%" style="display: block; margin: auto;" /> --- .full-width[.content-box-blue.bold.font120[`lubridate::floor_date() ...`:时间成分取整]] -- ```r # floor_date(x, unit = "second", week_start = getOption("lubridate.week.start", 7)) # unit: second, minute, hour, day, week, month, year, # bimonth, quarter, season, halfyear; 3 months, etc. flights_dt %>% * count(week = floor_date(dep_dttm, unit = "week")) %>% ggplot(aes(week, n)) + geom_line() + scale_x_datetime(date_labels = "%YW%U") # ?strptime ``` <img src="L08_String_Factor_Datetime_files/figure-html/unnamed-chunk-81-1.png" width="50%" style="display: block; margin: auto;" /> --- layout: true ### >> 3.3 时间间隔(time spans) --- `lubridate`包提供三种记录/处理时间间隔的方式: * _`duration`_:用秒精确记录/处理,`dseconds() | dhours() ...` * _`period`_:用人们熟悉的方式记录/处理,`seconds() | hours() ...` * _`interval`_:记录起点-终点之间的时间间隔,`%--%` -- .pull-left.code80[ ```r # Durations # --------------------------------------- dyears(1) ``` ``` #> [1] "31557600s (~1 years)" ``` ```r # Periods # --------------------------------------- years(1) ``` ``` #> [1] "1y 0m 0d 0H 0M 0S" ``` ```r # Intervals # --------------------------------------- today() %--% (today() + years(1)) ``` ``` #> [1] 2022-11-17 UTC--2023-11-17 UTC ``` ] -- .pull-right.code80[ ```r # A leap year ymd("2016-01-01") + dyears(1) # ?! ``` ``` #> [1] "2016-12-31 06:00:00 UTC" ``` ```r ymd("2016-01-01") + years(1) ``` ``` #> [1] "2017-01-01" ``` ```r # Arithmetic operations y1 <- ymd("2016-01-01") y2 <- y1 + years(1) (y1 %--% y2) / ddays(1) ``` ``` #> [1] 366 ``` ```r (y1 %--% y2) %/% days(1) ``` ``` #> [1] 366 ``` ] --- .full-width[.content-box-blue.bold.font120[不同数据类型之间允许的运算]] <img src="imgs/datetimes-arithmetic.png" width="65%" style="display: block; margin: auto;" /> -- .code90[ ```r # 示例 flights_dt <- flights_dt %>% mutate( overnight = arr_dttm < dep_dttm, arr_dttm = arr_dttm + days(overnight * 1), ) # Now all flights obey the laws of physics flights_dt %>% filter(overnight, arr_dttm < dep_dttm) ``` ``` #> # A tibble: 0 × 7 #> # … with 7 variables: origin <chr>, dest <chr>, dep_time <int>, arr_time <int>, #> # dep_dttm <dttm>, arr_dttm <dttm>, overnight <lgl> ``` ] --- layout: false class: inverse, center, middle # 课后作业 --- class: middle ### 课后作业 .Large[ 1. 根据本讲讲义的打印稿,在 📑 _Rmd_ 中键入并完成代码的运行 1. **复习** 📖 [_R for Data Science_](https://r4ds.had.co.nz/) 一书关于 [字符串](https://r4ds.had.co.nz/strings.html)、[因子](https://r4ds.had.co.nz/factors.html) 和 [日期-时间](https://r4ds.had.co.nz/dates-and-times.html) 等三种重要数据类型以及相应处理函数的内容(为中文翻译版的第10章、第11章和第12章) 1. 下载(打印)📰 [.bold[{{`stringr`的cheatsheet}}]](https://posit.co/wp-content/uploads/2022/10/strings.pdf)、[.bold[{{`forcats`的cheatsheet}}]](https://posit.co/wp-content/uploads/2022/10/factors.pdf) 和 [.bold[{{`lubridate`的cheatsheet}}]](https://posit.co/wp-content/uploads/2022/10/lubridate.pdf) 并阅读之 1. 完成第八讲的课后练习 👩💻 > .code100[ ```r remotes::install_github("qfwr2022/qfwr") library(qfwr) qfwr_ex("L08") ``` ] ] --- class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]