class: center, middle, inverse, title-slide .title[ # 量化金融与金融编程 ] .subtitle[ ## L06 数据导入与齐整 ] .author[ ###
曾永艺 ] .institute[ ### 厦门大学管理学院 ] .date[ ###
2022-11-11 ] --- class: middle background-image: url(imgs/readr-tidyr.png) background-size: 18% background-position: 85% 50%
> 问渠那得清如许?为有源头活水来。——[宋]朱熹 -- > Like families, tidy datasets are all alike but every messy <br> > dataset is messy in its own way. ——Hadley Wickham <br> -- .bold.font150[1\. 数据导入] * .font120[读入矩形文本数据:`readr` 包 <sup>.font80[2.1.3]</sup>] * .font120[读入其它类型的数据] -- .bold.font150[2\. 数据齐整] * .font120[齐整数据] * .font120[`tidyr` 包 <sup>.font80[1.2.1]</sup>] * .font120[`pivot_longer()` 和 `pivot_wider()`] * .font120[`separate()` 和 `unite()`] * .font120[复杂案例] * .font120[`unnest_*()` 和 `hoist()`] --- class: inverse, center, middle # 1. 数据导入 .font150[(Data Import)] --- layout: true ### >> 读入文本文档:`readr` 包 --- .full-width[.content-box-blue.bold.font120[☑ 一致的函数命名方式]] ```r read_csv() | read_csv2() | read_tsv() | read_delim() | read_table() | read_fwf() | read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() | ``` -- .full-width[.content-box-blue.bold.font120[☑ 基本一致的参数设置]] ```r read_*(file, # delim = NULL, escape_backslash = FALSE, escape_double = TRUE, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy()) # file: Either a path to a file, a connection, or literal data. # Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. # Files starting with http://, https://, ftp://, or ftps:// will be downloaded. # Using a value of clipboard() will read from the system clipboard. ``` --- .full-width[.content-box-blue.bold.font120[☑ 和内置 `utils` 包中 `read.*()` 比较]] -- .font110[ * 生成 `tibble`.red[<sup>*</sup>](更确切地说应该是 `spec_tbl_df`) * 默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(`read.*()`有个`stringsAsFactors`参数)、不会自动更改列名、不会将一列数据转化为行名等 * 较少依赖系统和环境变量,结果更具可重现性(_reproducible_) * 读入速度通常更快(~50x),且读入大文件时有进度条提示 ] -- ``` ?`tbl_df-class` tibble: a subclass of data.frame and the central data structure for the tidyverse lazy and surly - do less and complain more than base data.frames. ====================================================================================== 1. 列数据不进行自动类型转换(如字符->因子),且原生直接支持列表列 2. 只对长度为1的“标量”向量进行循环操作 3. tibble支持不合法的数据变量名,如 tibble(`:)` = "smile") 4. 不增加行名,也不鼓励使用行名来存储数据信息 5. 用 [ 对tibble取子集操作总是返回新的tibble,[[ 和 $ 总是返回向量,且 $ 不支持变量名部分匹配 6. tibble在屏幕上打印输出时会适应屏幕大小,并提供更多有用信息(类似str()) ``` --- .full-width[.content-box-blue.bold.font120[常用参数示例:`read_csv()`]] -- .pull-left.code80[ ```r library(tidyverse) read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", skip = 2, comment = "#") ``` ``` #> Rows: 1 Columns: 3 #> ── Column specification ──────────────── #> Delimiter: "," #> dbl (3): x, y, z #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 ``` ] -- .pull-right.code80[ ```r read_csv("1,2,3\n4,5,6") ``` ``` #> # A tibble: 1 × 3 #> `1` `2` `3` #> <dbl> <dbl> <dbl> #> 1 4 5 6 ``` ```r read_csv("1,2,3\n4,5,6", col_names = FALSE) ``` ``` #> # A tibble: 2 × 3 #> X1 X2 X3 #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ```r read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z")) ``` ``` #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ] --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] ```r # 生成示例数据文档 set.seed(123456) # 设定随机数种子 nycflights13::flights %>% mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% select(month:dep_time, dep_delay, tailnum, time_hour) %>% slice_sample(n = 20) %>% mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% mutate(time_hour = lubridate::as_datetime(time_hour)) %>% * print(n = 5) %>% write_csv(file = "data/ex_flights.csv", na = "--") # 默认na = "NA" ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE <NA> 2013-06-26 12:00:00 #> 2 NA 5 NA NA <NA> 2013-12-05 17:00:00 #> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00 #> 4 5 16 NA FALSE N27152 2013-05-16 12:00:00 #> 5 12 23 NA NA <NA> 2013-12-23 23:00:00 #> # … with 15 more rows ``` --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] ```r (ex_flights <- read_csv("data/ex_flights.csv")) # 默认na = c("", "NA") ``` ``` #> Rows: 20 Columns: 6 #> ── Column specification ────────────────────────────────────────────────────────────────── #> Delimiter: "," #> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 6 26 932 TRUE -- 2013-06-26T12:00:00Z #> 2 -- 5 -- -- -- 2013-12-05T17:00:00Z #> 3 7 20 656 FALSE N17245 2013-07-20T10:00:00Z #> 4 5 16 -- FALSE N27152 2013-05-16T12:00:00Z #> 5 12 23 -- -- -- 2013-12-23T23:00:00Z #> # … with 15 more rows ``` --- .full-width[.content-box-blue.bold.font120[`readr` 包采用启发式策略来解析文本文档中数据]] > .font120[1\. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵] > .font120[2\. 确定每列字符串向量的类型] > 2\.1 由 `col_types` 参数直接给定 > 2\.2 猜测:读入文档 `guess_max` 行(v2 vs. v1),并按如下顺序猜测每列变量的类型:*logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character * > .font120[3\. 将每列字符串解析为相应类型的向量] -- .full-width[.content-box-blue.bold.font120[当发现 `readr` 自动解析碰到问题时]] * 可用 `problems()` 查看读入时遇到的问题 * 根据需要设定 `read_*()` 的相关参数,如 `na = "--"` * 直接设定 `read_*()` 的参数 `col_types = cols(...)` * 将全部数据读入为字符型变量,然后再用 `type_convert()` 转换变量类型 * 在后续的数据整理步骤中进行相应的处理(如 `dplyr::muate() + parse_*()`) --- .pull-left.code75[ ```r # 设定缺失值参数 read_csv("data/ex_flights.csv", na = "--") %>% spec() ``` ``` #> cols( #> month = col_double(), #> day = col_double(), #> dep_time = col_double(), #> dep_delay = col_logical(), #> tailnum = col_character(), #> time_hour = col_datetime(format = "") #> ) ``` ```r # 直接设定参数col_types,?cols ex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() ) # col_types = "iiilcT" ) ``` ``` #> Warning: One or more parsing issues, call `problems()` on #> your data frame for details, e.g.: #> dat <- vroom(...) #> problems(dat) ``` ] -- .pull-right.code80[ ```r problems(ex_flights) ``` ``` #> # A tibble: 17 × 5 #> row col expected actual file #> <int> <int> <chr> <chr> <chr> #> 1 3 1 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im… #> 2 3 3 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im… #> 3 3 4 1/0/T/F/TRUE/FALSE -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im… #> # … with 14 more rows ``` ```r # 全部读入,再行转换类型 read_csv( "data/ex_flights.csv", col_types = cols( * .default = col_character() ) ) %>% * type_convert(col_types = "iiilcT") ``` ``` #> # A tibble: 20 × 6 #> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE -- 2013-06-26 12:00:00 #> 2 NA 5 NA NA -- 2013-12-05 17:00:00 #> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00 #> # … with 17 more rows ``` ] --- .full-width[.content-box-blue.bold.font120[一次性读入多个相同性质的文档]] ```r # 生成相同性质的示例文档 nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv"))) ``` -- ```r # 文件路径向量 (files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE)) ``` ``` #> [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv" #> [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv" #> [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv" #> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv" #> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv" #> [16] "data/flights_YV.tsv" ``` ```r # 一次性读入(并纵向合并为 tibble) read_tsv(files, id = "fpath") # 这里通过设置 id 参数将文件路径存入指定变量fpath中 ``` ``` #> # A tibble: 336,776 × 20 #> fpath year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> #> 1 data/f… 2013 1 1 810 810 0 1048 1037 11 9E 3538 #> 2 data/f… 2013 1 1 1451 1500 -9 1634 1636 -2 9E 4105 #> 3 data/f… 2013 1 1 1452 1455 -3 1637 1639 -2 9E 3295 #> # … with 336,773 more rows, 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and #> # abbreviated variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time, #> # ⁵sched_arr_time, ⁶arr_delay ``` --- .full-width[.content-box-blue.bold.font120[`readr` 的图形化用户界面 <sup>.font80.red[*]</sup>]] <img src="imgs/readr_UI.png" width="88%" style="display: block; margin: auto;" /> .footnote.red.font80[\* `RStudio` 右上 `Environment` 标签页 > `Import Dataset` > `From Text (readr)` ...] --- layout: true ### >> 读入其它类型的数据 --- -- .font110[ * .bold[`readr::read_rds()`]: `R` data files (*.rds*) ] -- .pull-left.font110[ *  [~~.bold[{{`vroom`}}~~]](https://vroom.r-lib.org/): Read and Write Rectangular Text Data **Quickly** *  [.bold[{{`readxl`}}]](https://readxl.tidyverse.org/): excel files (_.xls_ and _.xlsx_) <- take a 🧐 at it! *  [.bold[{{`haven`}}]](https://haven.tidyverse.org/): SPSS, Stata, and SAS files *  [.bold[{{`arrow`}}]](https://arrow.apache.org/docs/r/): Apache Arrow ] -- .pull-right.font110[ * [.bold[{{`DBI`}}]](https://dbi.r-dbi.org/) with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL * .bold[`jsonlite`]: JSON files * [.bold[{{`xml2`}}]](https://xml2.r-lib.org/): XML and HTML files *  [.bold[{{`rvest`}}]](https://rvest.tidyverse.org): HTML (Web Scraping) * [.bold[{{`httr2`}}]](https://httr2.r-lib.org/): Web APIs ] -- <br> .font110[ * .bold[[{{其它}}](https://cran.r-project.org/web/views/)]: text、network、spatial、genome、image 等类型的数据 ] --- layout: false class: inverse, center, middle # 2. 数据齐整 .font150[(Tidy Data)] --- layout: true ### >> 齐整数据 --- .full-width[.content-box-blue.bold.font120[齐整数据的三个条件 📐]] .font120[ * 每列都是一个变量(Every column is a variable.) * 每行都是一个观测(Every row is an observation.) * 每格都是一个取值(Every cell is a single value.) ] -- <img src="imgs/tidy-1.png" width="85%" style="display: block; margin: auto;" /> --- .full-width[.content-box-blue.bold.font120[齐整数据的好处 💪]] .font120[ ☑ 齐整数据按照逻辑一致的方式存储数据,这让你更容易学习并掌握相关工具对数据进行处理 ☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势 ] <img src="imgs/tidy-2.png" width="80%" style="display: block; margin: auto;" /> .font120[ ☑ `tidyverse` 中的 R 包(如 `dplyr` 等)在设计上要求输入数据为齐整数据 ] --- .full-width[.content-box-blue.bold.font120[但……并非所有的数据集都是齐整的,😭]] .pull-left.code80[ ```r table1 ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # … with 3 more rows ``` ```r table2 ``` ``` #> # A tibble: 12 × 4 #> country year type count #> <chr> <int> <chr> <int> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> # … with 9 more rows ``` ] .pull-right.code70[ ```r table3 ``` ``` #> # A tibble: 6 × 3 #> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> # … with 3 more rows ``` ```r table4a # cases ``` ``` #> # A tibble: 3 × 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 745 2666 #> 2 Brazil 37737 80488 #> 3 China 212258 213766 ``` ```r table4b # population ``` ``` #> # A tibble: 3 × 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 19987071 20595360 #> 2 Brazil 172006362 174504898 #> 3 China 1272915272 1280428583 ``` ] --- layout: true ### >> 该 `tidyr` 包出场啦 🎉 --- .full-width[.content-box-blue.bold.font120[`tidyr`包中的函数大致可分为5大类:]] .pull-left.font105[ 1. 数据长型-宽型转换(_pivoting_):`pivot_longer()` 和 `pivot_wider()` 2. 分解或合并字符型变量:`separate()`、`separate_rows()`、`extract()` 和 `unite()` 3. 将深度嵌套的列表数据表格化(_rectangling_):`unnest_longer()`、`unnest_wider()`、`unnest_auto()` 和 `hoist()` 4. 将分组数据框转化成嵌套的数据框(每组占一行)或反之: `nest()` 和 `unnest()` 5. **缺失值处理:->** ] -- .pull-right.font80[ - `complete(data, ..., fill = list())`: Complete a data frame with missing combinations of data - `expand(data, ...)`, `crossing(...)`, `nesting(...)`: Expand data frame to include all combinations of values - `expand_grid(...)`: Create a tibble from all combinations of the inputted name-value pairs - `full_seq(x, period, tol = 1e-06)`: Create the full sequence of values in a vector - `drop_na(data, ...)`: Drop rows containing missing values - `fill(data, ..., .direction = c("down", "up", "downup", "updown"))`: Fill in missing values with previous or next value - `replace_na(data, replace, ...)`: Replace missing values with specified values ] --- layout: true ### >> **.red[`pivot_longer()`]** 和 `pivot_wider()` --- <img src="imgs/tidy-3.png" width="75%" style="display: block; margin: auto;" /> -- .pull-left.code80[ ```r pivot_longer( * data, cols, * names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", * values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, ... ) ``` ] -- .pull-right.code80[ ```r table4a %>% pivot_longer( cols = c(`1999`, `2000`), names_to = "year", values_to = "cases" ) # table4b %>% ... ``` ``` #> # A tibble: 6 × 3 #> country year cases #> <chr> <chr> <int> #> 1 Afghanistan 1999 745 #> 2 Afghanistan 2000 2666 #> 3 Brazil 1999 37737 #> # … with 3 more rows ``` ] --- layout: true ### >> `pivot_longer()` 和 **.red[`pivot_wider()`]** --- <img src="imgs/tidy-4.png" width="75%" style="display: block; margin: auto;" /> -- .pull-left.code80[ ```r pivot_wider( * data, id_cols = NULL, id_expand = FALSE, * names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", * values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, ... ) ``` ] -- .pull-right.code80[ ```r table2 %>% pivot_wider( names_from = type, values_from = count ) ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # … with 3 more rows ``` ] --- layout: true ### >> **.red[`separate()`、`extract()`]** 和 `unite()` --- <img src="imgs/tidy-5.png" width="55%" style="display: block; margin: auto;" /> -- .pull-left.code80[ ```r separate( * data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ... ) extract( * data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... ) separate_rows(data, ..., sep, convert) ``` ] -- .pull-right.code80[ ```r table3 %>% separate( rate, into = c("cases", "population"), convert = TRUE ) ``` ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> # … with 3 more rows ``` ] --- layout: true ### >> `separate()`、`extract()` 和 **.red[`unite()`]** --- <img src="imgs/tidy-6.png" width="80%" style="display: block; margin: auto;" /> -- .pull-left.code100[ ```r unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE ) ``` ] -- .pull-right.code90[ ```r table5 %>% unite( col = "year", century, year, sep = "" ) ``` ``` #> # A tibble: 6 × 3 #> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> # … with 3 more rows ``` ] --- layout: true ### >> 复杂例子 --- ```r who # ?who ``` ``` #> # A tibble: 7,240 × 60 #> country iso2 iso3 year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷ #> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> #> 1 Afghanistan AF AFG 1980 NA NA NA NA NA NA NA #> 2 Afghanistan AF AFG 1981 NA NA NA NA NA NA NA #> 3 Afghanistan AF AFG 1982 NA NA NA NA NA NA NA #> 4 Afghanistan AF AFG 1983 NA NA NA NA NA NA NA #> 5 Afghanistan AF AFG 1984 NA NA NA NA NA NA NA #> 6 Afghanistan AF AFG 1985 NA NA NA NA NA NA NA #> 7 Afghanistan AF AFG 1986 NA NA NA NA NA NA NA #> 8 Afghanistan AF AFG 1987 NA NA NA NA NA NA NA #> 9 Afghanistan AF AFG 1988 NA NA NA NA NA NA NA #> 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA NA #> # … with 7,230 more rows, 49 more variables: new_sp_f014 <int>, new_sp_f1524 <int>, #> # new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>, #> # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>, #> # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>, #> # new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>, #> # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>, #> # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, … ``` --- .code80[ ```r who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE ) who1 ``` ``` #> # A tibble: 76,046 × 6 #> country iso2 iso3 year name cases #> <chr> <chr> <chr> <int> <chr> <int> #> 1 Afghanistan AF AFG 1997 new_sp_m014 0 #> 2 Afghanistan AF AFG 1997 new_sp_m1524 10 #> 3 Afghanistan AF AFG 1997 new_sp_m2534 6 #> # … with 76,043 more rows ``` ] -- > 变量 `name` 中的字符串代表什么?通过查阅 `who` 的数据字典得到如下信息: > 1. 前3个字母代表是否是“新增”肺结核病例,在数据中均为`new` > 2. 接下来2-3个字母代表肺结核的类型(`sp`、`sn`、`ep`、`rel`等) > 3. 接下来的字母`m`或`f`代表肺结核病人的性别(男或女) > 4. 余下数字代表年龄组,如 `014 = 0 - 14岁 | 1524 = 15 - 24岁` --- .code80[ ```r who2 <- who1 %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符 who2 ``` ``` #> # A tibble: 76,046 × 7 #> country iso2 iso3 year type sexage cases #> <chr> <chr> <chr> <int> <chr> <chr> <int> #> 1 Afghanistan AF AFG 1997 sp m014 0 #> 2 Afghanistan AF AFG 1997 sp m1524 10 #> 3 Afghanistan AF AFG 1997 sp m2534 6 #> # … with 76,043 more rows ``` ] -- .code80[ ```r who3 <- who2 %>% separate(sexage, c("sex", "age"), sep = 1) # 基于位置 who3 ``` ``` #> # A tibble: 76,046 × 8 #> country iso2 iso3 year type sex age cases #> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> #> 1 Afghanistan AF AFG 1997 sp m 014 0 #> 2 Afghanistan AF AFG 1997 sp m 1524 10 #> 3 Afghanistan AF AFG 1997 sp m 2534 6 #> # … with 76,043 more rows ``` ] --- .code80[ ```r *# all in a pipe! who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1) ``` ] -- .pull-left.code80[ ```r *# all with powerful pivot_longer()! who %>% pivot_longer( cols = starts_with("new"), * names_to = c("type", "sex", "age"), * names_pattern = "new_?(.*)_(.)(.*)", * names_ptypes = list( * sex = factor(levels = c("f", "m")), * age = factor( * levels = c("014", "1524", "2534", * "3544", "4554", "5564", "65"), * ordered = TRUE)), values_to = "cases", values_drop_na = TRUE ) ``` ] .pull-right.code80[ ``` #> # A tibble: 76,046 × 8 #> country iso2 iso3 year type sex #> <chr> <chr> <chr> <int> <chr> <fct> #> 1 Afghani… AF AFG 1997 sp m #> 2 Afghani… AF AFG 1997 sp m #> 3 Afghani… AF AFG 1997 sp m #> 4 Afghani… AF AFG 1997 sp m #> 5 Afghani… AF AFG 1997 sp m #> # … with 76,041 more rows, and 2 more #> # variables: age <ord>, cases <int> ``` <br> .font120.bold[👉 `vignette("pivot")`] ] --- layout: false class: hide_logo ## 🙋♂️ Your Turn! <style type="text/css"> #special_timer.running { background-color: black; background-image: url(imgs/bg-stars.gif); } #special_timer.finished { background-color: black; background-image: url(imgs/bg-sqfw.gif); background-size: cover; } #special_timer.running .countdown-digits { color: #fdf6e3; } #special_timer.finished .countdown-digits { color: #fdf6e3; } </style>
−
+
05
:
00
.panelset[ .panel[.panel-name[问题] .font120[在空白处填入恰当的代码将数据集 billboard 转化为目标结构:] .pull-left.code100[ <code class ='r hljs remark-code'>library(tidyverse)<br>billboard %>% View()<br>billboard %>%<br> pivot_<span style='background-color:#ffff7f'> </span>(<br> cols = <span style='background-color:#ffff7f'> </span>,<br> names_to = <span style='background-color:#ffff7f'> </span>,<br> <span style='background-color:#ffff7f'> </span> ,<br> names_transform = <span style='background-color:#ffff7f'> </span>,<br> values_to = <span style='background-color:#ffff7f'> </span>,<br> values_drop_na = TRUE<br> )</code> ] .pull-right.code80[ ``` *# A tibble: 317 × 79 artist track date.ent…¹ wk1 wk2 <chr> <chr> <date> <dbl> <dbl> 1 2 Pac Baby… 2000-02-26 87 82 2 2Ge+her The … 2000-09-02 91 87 3 3 Doors … Kryp… 2000-04-08 81 70 # … with 314 more rows, 74 more # variables: wk3 <dbl>, wk4 <dbl>, # wk5 <dbl>, wk6 <dbl>, wk7 <dbl>, # wk8 <dbl>, wk9 <dbl>, wk10 <dbl>, # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, # wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, # wk17 <dbl>, wk18 <dbl>, … ``` ``` *# A tibble: 5,307 × 5 artist track date.ent…¹ week rank <chr> <chr> <date> <int> <dbl> 1 2 Pac Baby Do… 2000-02-26 1 87 2 2 Pac Baby Do… 2000-02-26 2 82 3 2 Pac Baby Do… 2000-02-26 3 72 # … with 5,304 more rows, and # abbreviated variable name # ¹date.entered ``` ] ] .panel[.panel-name[参考答案] .code110[ ```r library(tidyverse) billboard %>% View() billboard %>% pivot_longer( cols = starts_with("wk") , names_to = "week" , names_prefix = "wk" , names_transform = list(week = as.integer), values_to = "rank" , values_drop_na = TRUE ) ``` ] ] ] --- layout: true ### >> `unnest_*` 和 `hoist()` --- .pull-left.code90[ ```r repurrrsive::gh_repos %>% View() ``` <img src="imgs/repos.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] -- .pull-right.code80[ ```r (repos <- tibble(repo = repurrrsive::gh_repos)) ``` ``` #> # A tibble: 6 × 1 #> repo #> <list> #> 1 <list [30]> #> 2 <list [30]> #> 3 <list [30]> #> 4 <list [26]> #> 5 <list [30]> #> # … with 1 more row ``` ```r (repos <- repos %>% unnest_longer(repo)) ``` ``` #> # A tibble: 176 × 1 #> repo #> <list> #> 1 <named list [68]> #> 2 <named list [68]> #> 3 <named list [68]> #> 4 <named list [68]> #> 5 <named list [68]> #> # … with 171 more rows ``` ] --- .pull-left.code90[ ```r repos %>% unnest_wider(repo) ``` ``` #> # A tibble: 176 × 68 #> id name full_…¹ owner #> <int> <chr> <chr> <list> #> 1 61160198 after gaborc… <named list> #> 2 40500181 argufy gaborc… <named list> #> 3 36442442 ask gaborc… <named list> #> 4 34924886 baseimp… gaborc… <named list> #> 5 61620661 citest gaborc… <named list> #> # … with 171 more rows, 64 more #> # variables: private <lgl>, #> # html_url <chr>, description <chr>, #> # fork <lgl>, url <chr>, #> # forks_url <chr>, keys_url <chr>, #> # collaborators_url <chr>, #> # teams_url <chr>, hooks_url <chr>, … ``` ] -- .pull-right.code90[ ```r *repos %>% hoist( repo, * login = c("owner", "login"), name = "name", homepage = "homepage", watchers = "watchers_count" ) ``` ``` #> # A tibble: 176 × 5 *#> login name homep…¹ watch…² *#> <chr> <chr> <chr> <int> #> 1 gaborcsardi after <NA> 5 #> 2 gaborcsardi argufy <NA> 19 #> 3 gaborcsardi ask <NA> 5 #> 4 gaborcsardi baseimpor… <NA> 0 #> 5 gaborcsardi citest <NA> 0 #> # … with 171 more rows, 1 more #> # variable: repo <list>, and #> # abbreviated variable names #> # ¹homepage, ²watchers ``` <br> .font120.bold[👉 `vignette("rectangle")`] ] --- layout: false class: inverse, center, middle # 课后作业 --- <br> .font130[ 1\. 复习 📖 [_R for Data Science_](https://r4ds.had.co.nz/) 一书的第10章、第11章和第12章的内容 2\. 下载(打印) 📰 .bold[[{{Data Import的cheatsheet}}](https://posit.co/wp-content/uploads/2022/10/data-import.pdf)] 和 .bold[[{{Data Tidying的cheatsheet}}](https://posit.co/wp-content/uploads/2022/10/tidyr.pdf)] 并阅读之 3\. `browseVignettes(package = c("readr", "tidyr"))` 📝 > .code80[ ``` Introduction to readr Tidy data Pivoting ``` ] 4\. 完成第六讲的课后练习 👩💻 > .code90[ ```r remotes::install_github("qfwr2022/qfwr") # 记得选择允许安装qfwr包所依赖的learnr、testwhat等包 library(qfwr) qfwr_ex("L06") ``` ] ] --- class: center, middle, hide_logo background-image: url(imgs/xaringan.png) background-size: 12% background-position: 50% 40% <br><br><br><br><br><br><br> <hr color='#f00' size='2px' width='80%'> <br> .Large.red[_**本网页版讲义的制作由 R 包 [{{`xaringan`}}](https://github.com/yihui/xaringan) 赋能!**_]