量化金融与金融编程
L06 数据导入与齐整

曾永艺
厦门大学管理学院

2022-11-11
1 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

2 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham

2 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham

1. 数据导入

读入矩形文本数据：readr 包 ^2.1.3
读入其它类型的数据

2 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham

1. 数据导入

读入矩形文本数据：readr 包 ^2.1.3
读入其它类型的数据

2. 数据齐整

齐整数据
tidyr 包 ^1.2.1
pivot_longer() 和 pivot_wider()
separate() 和 unite()
复杂案例
unnest_*() 和 hoist()

2 / 32

1. 数据导入

(Data Import)

3 / 32

>> 读入文本文档：`readr` 包

☑ 一致的函数命名方式

read_csv()   | read_csv2()      | read_tsv()   | read_delim()    |
read_table() | read_fwf()       |
read_lines() | read_lines_raw() | read_file()  | read_file_raw() | read_log()  |

4 / 32

>> 读入文本文档：`readr` 包

☑ 一致的函数命名方式

read_csv()   | read_csv2()      | read_tsv()   | read_delim()    |
read_table() | read_fwf()       |
read_lines() | read_lines_raw() | read_file()  | read_file_raw() | read_log()  |

☑ 基本一致的参数设置

read_*(file,  # delim = NULL, escape_backslash = FALSE, escape_double = TRUE,
       col_names = TRUE, col_types = NULL, col_select = NULL, 
       id = NULL, locale = default_locale(), 
       na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, 
       skip = 0, n_max = Inf, guess_max = min(1000, n_max), 
       name_repair = "unique", num_threads = readr_threads(), 
       progress = show_progress(), show_col_types = should_show_types(), 
       skip_empty_rows = TRUE, lazy = should_read_lazy())
# file: Either a path to a file, a connection, or literal data. 
#       Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.
#       Files starting with http://, https://, ftp://, or ftps:// will be downloaded.
#       Using a value of clipboard() will read from the system clipboard.

4 / 32

>> 读入文本文档：`readr` 包

☑ 和内置 utils 包中 read.*() 比较

5 / 32

>> 读入文本文档：`readr` 包

☑ 和内置 utils 包中 read.*() 比较

生成 tibble^*（更确切地说应该是 spec_tbl_df）
默认情况下做得更少，如不会自动将字符型变量转换为因子型变量（read.*()有个stringsAsFactors参数）、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量，结果更具可重现性（reproducible）
读入速度通常更快（~50x），且读入大文件时有进度条提示

5 / 32

>> 读入文本文档：`readr` 包

☑ 和内置 utils 包中 read.*() 比较

生成 tibble^*（更确切地说应该是 spec_tbl_df）
默认情况下做得更少，如不会自动将字符型变量转换为因子型变量（read.*()有个stringsAsFactors参数）、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量，结果更具可重现性（reproducible）
读入速度通常更快（~50x），且读入大文件时有进度条提示

?`tbl_df-class`
tibble: a subclass of data.frame and the central data structure for the tidyverse
lazy and surly - do less and complain more than base data.frames.
======================================================================================
1. 列数据不进行自动类型转换（如字符->因子），且原生直接支持列表列
2. 只对长度为1的“标量”向量进行循环操作
3. tibble支持不合法的数据变量名，如 tibble(`:)` = "smile")
4. 不增加行名，也不鼓励使用行名来存储数据信息
5. 用 [ 对tibble取子集操作总是返回新的tibble，[[ 和 $ 总是返回向量，且 $ 不支持变量名部分匹配
6. tibble在屏幕上打印输出时会适应屏幕大小，并提供更多有用信息（类似str()）

5 / 32

>> 读入文本文档：`readr` 包

常用参数示例：read_csv()

6 / 32

>> 读入文本文档：`readr` 包

常用参数示例：read_csv()

library(tidyverse)
read_csv("The 1st line of metadata
         The 2nd line of metadata
         # A comment to skip
         x,y,z
         1,2,3", 
         skip = 2,
         comment = "#")

#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

6 / 32

>> 读入文本文档：`readr` 包

常用参数示例：read_csv()

library(tidyverse)
read_csv("The 1st line of metadata
         The 2nd line of metadata
         # A comment to skip
         x,y,z
         1,2,3", 
         skip = 2,
         comment = "#")

#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

read_csv("1,2,3\n4,5,6")

#> # A tibble: 1 × 3
#>     `1`   `2`   `3`
#>   <dbl> <dbl> <dbl>
#> 1     4     5     6

read_csv("1,2,3\n4,5,6", 
         col_names = FALSE)

#> # A tibble: 2 × 3
#>      X1    X2    X3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

read_csv("1,2,3\n4,5,6", 
         col_names = c("x", "y", "z"))

#> # A tibble: 2 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

6 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

#  生成示例数据文档
set.seed(123456)   # 设定随机数种子
nycflights13::flights %>% 
  mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% 
  select(month:dep_time, dep_delay, tailnum, time_hour) %>% 
  slice_sample(n = 20) %>%  
  mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% 
  mutate(time_hour = lubridate::as_datetime(time_hour)) %>% 
  print(n = 5) %>%
  write_csv(file = "data/ex_flights.csv", na = "--")  # 默认na = "NA"

#> # A tibble: 20 × 6
#>   month   day dep_time dep_delay tailnum time_hour          
#>   <int> <int>    <int> <lgl>     <chr>   <dttm>             
#> 1     6    26      932 TRUE      <NA>    2013-06-26 12:00:00
#> 2    NA     5       NA NA        <NA>    2013-12-05 17:00:00
#> 3     7    20      656 FALSE     N17245  2013-07-20 10:00:00
#> 4     5    16       NA FALSE     N27152  2013-05-16 12:00:00
#> 5    12    23       NA NA        <NA>    2013-12-23 23:00:00
#> # … with 15 more rows

7 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

(ex_flights <- read_csv("data/ex_flights.csv"))  # 默认na = c("", "NA")

#> Rows: 20 Columns: 6
#> ── Column specification ──────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> # A tibble: 20 × 6
#>   month day   dep_time dep_delay tailnum time_hour           
#>   <chr> <chr> <chr>    <chr>     <chr>   <chr>               
#> 1 6     26    932      TRUE      --      2013-06-26T12:00:00Z
#> 2 --    5     --       --        --      2013-12-05T17:00:00Z
#> 3 7     20    656      FALSE     N17245  2013-07-20T10:00:00Z
#> 4 5     16    --       FALSE     N27152  2013-05-16T12:00:00Z
#> 5 12    23    --       --        --      2013-12-23T23:00:00Z
#> # … with 15 more rows

8 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由 col_types 参数直接给定
2.2 猜测：读入文档 guess_max 行（v2 vs. v1），并按如下顺序猜测每列变量的类型：logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量

9 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由 col_types 参数直接给定
2.2 猜测：读入文档 guess_max 行（v2 vs. v1），并按如下顺序猜测每列变量的类型：logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量

当发现 readr 自动解析碰到问题时

可用 problems() 查看读入时遇到的问题
根据需要设定 read_*() 的相关参数，如 na = "--"
直接设定 read_*() 的参数 col_types = cols(...)
将全部数据读入为字符型变量，然后再用 type_convert() 转换变量类型
在后续的数据整理步骤中进行相应的处理（如 dplyr::muate() + parse_*()）

9 / 32

>> 读入文本文档：readr 包# 设定缺失值参数
read_csv("data/ex_flights.csv", 
         na = "--") %>% spec()

#> cols(
#>   month = col_double(),
#>   day = col_double(),
#>   dep_time = col_double(),
#>   dep_delay = col_logical(),
#>   tailnum = col_character(),
#>   time_hour = col_datetime(format = "")
#> )
# 直接设定参数col_types，?cols
ex_flights <- read_csv(
  "data/ex_flights.csv", 
  col_types = cols(  # cols_only()
    dep_delay = col_logical(),    # "l"
    tailnum   = col_character(),  # "c"
    time_hour = col_datetime(),   # "T"
    .default  = col_integer()
  ) # col_types = "iiilcT"
)

#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
10 / 32

>> 读入文本文档：readr 包# 设定缺失值参数
read_csv("data/ex_flights.csv", 
         na = "--") %>% spec()

#> cols(
#>   month = col_double(),
#>   day = col_double(),
#>   dep_time = col_double(),
#>   dep_delay = col_logical(),
#>   tailnum = col_character(),
#>   time_hour = col_datetime(format = "")
#> )
# 直接设定参数col_types，?cols
ex_flights <- read_csv(
  "data/ex_flights.csv", 
  col_types = cols(  # cols_only()
    dep_delay = col_logical(),    # "l"
    tailnum   = col_character(),  # "c"
    time_hour = col_datetime(),   # "T"
    .default  = col_integer()
  ) # col_types = "iiilcT"
)

#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
problems(ex_flights)

#> # A tibble: 17 × 5
#>     row   col expected           actual file                                              
#>   <int> <int> <chr>              <chr>  <chr>                                             
#> 1     3     1 an integer         --     C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> 2     3     3 an integer         --     C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> 3     3     4 1/0/T/F/TRUE/FALSE --     C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> # … with 14 more rows
# 全部读入，再行转换类型
read_csv(
  "data/ex_flights.csv",
  col_types = cols(
    .default = col_character()
  )
) %>% 
  type_convert(col_types = "iiilcT")

#> # A tibble: 20 × 6
#>   month   day dep_time dep_delay tailnum time_hour          
#>   <int> <int>    <int> <lgl>     <chr>   <dttm>             
#> 1     6    26      932 TRUE      --      2013-06-26 12:00:00
#> 2    NA     5       NA NA        --      2013-12-05 17:00:00
#> 3     7    20      656 FALSE     N17245  2013-07-20 10:00:00
#> # … with 17 more rows
10 / 32

>> 读入文本文档：`readr` 包

一次性读入多个相同性质的文档

# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>% 
  purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))

11 / 32

>> 读入文本文档：`readr` 包

一次性读入多个相同性质的文档

# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>% 
  purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))

# 文件路径向量
(files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE))

#>  [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv"
#>  [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv"
#>  [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv"
#> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv"
#> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv"
#> [16] "data/flights_YV.tsv"

# 一次性读入（并纵向合并为 tibble）
read_tsv(files, id = "fpath")  # 这里通过设置 id 参数将文件路径存入指定变量fpath中

#> # A tibble: 336,776 × 20
#>   fpath    year month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight
#>   <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    <dbl>
#> 1 data/f…  2013     1     1     810     810       0    1048    1037      11 9E        3538
#> 2 data/f…  2013     1     1    1451    1500      -9    1634    1636      -2 9E        4105
#> 3 data/f…  2013     1     1    1452    1455      -3    1637    1639      -2 9E        3295
#> # … with 336,773 more rows, 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and
#> #   abbreviated variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
#> #   ⁵sched_arr_time, ⁶arr_delay

11 / 32

>> 读入文本文档：`readr` 包

readr 的图形化用户界面 ^*

* RStudio 右上 Environment 标签页 > Import Dataset > From Text (readr) ...

12 / 32

>> 读入其它类型的数据13 / 32

>> 读入其它类型的数据readr::read_rds(): R data files (.rds)

13 / 32

>> 读入其它类型的数据

readr::read_rds(): R data files (.rds)

~~{{vroom}}~~: Read and Write Rectangular Text Data Quickly
{{readxl}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{haven}}: SPSS, Stata, and SAS files
{{arrow}}: Apache Arrow

13 / 32

>> 读入其它类型的数据

readr::read_rds(): R data files (.rds)

~~{{vroom}}~~: Read and Write Rectangular Text Data Quickly
{{readxl}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{haven}}: SPSS, Stata, and SAS files
{{arrow}}: Apache Arrow

{{DBI}} with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL
jsonlite: JSON files
{{xml2}}: XML and HTML files
{{rvest}}: HTML (Web Scraping)
{{httr2}}: Web APIs

13 / 32

>> 读入其它类型的数据

readr::read_rds(): R data files (.rds)

~~{{vroom}}~~: Read and Write Rectangular Text Data Quickly
{{readxl}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{haven}}: SPSS, Stata, and SAS files
{{arrow}}: Apache Arrow

{{DBI}} with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL
jsonlite: JSON files
{{xml2}}: XML and HTML files
{{rvest}}: HTML (Web Scraping)
{{httr2}}: Web APIs

{{其它}}: text、network、spatial、genome、image 等类型的数据

13 / 32

2. 数据齐整

(Tidy Data)

14 / 32

>> 齐整数据

齐整数据的三个条件 📐

每列都是一个变量（Every column is a variable.）
每行都是一个观测（Every row is an observation.）
每格都是一个取值（Every cell is a single value.）

15 / 32

>> 齐整数据

齐整数据的三个条件 📐

每列都是一个变量（Every column is a variable.）
每行都是一个观测（Every row is an observation.）
每格都是一个取值（Every cell is a single value.）

15 / 32

>> 齐整数据

齐整数据的好处 💪

☑ 齐整数据按照逻辑一致的方式存储数据，这让你更容易学习并掌握相关工具对数据进行处理

☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势

☑ tidyverse 中的 R 包（如 dplyr 等）在设计上要求输入数据为齐整数据

16 / 32

>> 齐整数据

但……并非所有的数据集都是齐整的，😭

table1

#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <int> <int>      <int>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # … with 3 more rows

table2

#> # A tibble: 12 × 4
#>   country      year type          count
#>   <chr>       <int> <chr>         <int>
#> 1 Afghanistan  1999 cases           745
#> 2 Afghanistan  1999 population 19987071
#> 3 Afghanistan  2000 cases          2666
#> # … with 9 more rows

table3

#> # A tibble: 6 × 3
#>   country      year rate           
#> * <chr>       <int> <chr>          
#> 1 Afghanistan  1999 745/19987071   
#> 2 Afghanistan  2000 2666/20595360  
#> 3 Brazil       1999 37737/172006362
#> # … with 3 more rows

table4a   # cases

#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b   # population

#> # A tibble: 3 × 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

17 / 32

>> 该 `tidyr` 包出场啦 🎉

tidyr包中的函数大致可分为5大类：

数据长型-宽型转换（pivoting）：pivot_longer() 和 pivot_wider()
分解或合并字符型变量：separate()、separate_rows()、extract() 和 unite()
将深度嵌套的列表数据表格化（rectangling）：unnest_longer()、unnest_wider()、unnest_auto() 和 hoist()
将分组数据框转化成嵌套的数据框（每组占一行）或反之： nest() 和 unnest()
缺失值处理：->

18 / 32

>> 该 `tidyr` 包出场啦 🎉

tidyr包中的函数大致可分为5大类：

数据长型-宽型转换（pivoting）：pivot_longer() 和 pivot_wider()
分解或合并字符型变量：separate()、separate_rows()、extract() 和 unite()
将深度嵌套的列表数据表格化（rectangling）：unnest_longer()、unnest_wider()、unnest_auto() 和 hoist()
将分组数据框转化成嵌套的数据框（每组占一行）或反之： nest() 和 unnest()
缺失值处理：->

complete(data, ..., fill = list()): Complete a data frame with missing combinations of data
expand(data, ...), crossing(...), nesting(...): Expand data frame to include all combinations of values
expand_grid(...): Create a tibble from all combinations of the inputted name-value pairs
full_seq(x, period, tol = 1e-06): Create the full sequence of values in a vector
drop_na(data, ...): Drop rows containing missing values
fill(data, ..., .direction = c("down", "up", "downup", "updown")): Fill in missing values with previous or next value
replace_na(data, replace, ...): Replace missing values with specified values

18 / 32

>> `pivot_longer()` 和 `pivot_wider()`

19 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_longer(
  data, cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL, 
  names_pattern = NULL, 
  names_ptypes = NULL, 
  names_transform = NULL, 
  names_repair = "check_unique", 
  values_to = "value",
  values_drop_na = FALSE, 
  values_ptypes = NULL,
  values_transform = NULL, ...
)

19 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_longer(
  data, cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL, 
  names_pattern = NULL, 
  names_ptypes = NULL, 
  names_transform = NULL, 
  names_repair = "check_unique", 
  values_to = "value",
  values_drop_na = FALSE, 
  values_ptypes = NULL,
  values_transform = NULL, ...
)

table4a %>% 
  pivot_longer(
    cols = c(`1999`, `2000`),
    names_to = "year", 
    values_to = "cases"
  )
# table4b %>% ...

#> # A tibble: 6 × 3
#>   country     year  cases
#>   <chr>       <chr> <int>
#> 1 Afghanistan 1999    745
#> 2 Afghanistan 2000   2666
#> 3 Brazil      1999  37737
#> # … with 3 more rows

19 / 32

>> `pivot_longer()` 和 `pivot_wider()`

20 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_wider(
  data,
  id_cols = NULL, id_expand = FALSE,
  names_from = name,
  names_prefix = "", names_sep = "_", 
  names_glue = NULL, 
  names_sort = FALSE, 
  names_vary = "fastest",
  names_expand = FALSE, 
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL, values_fn = NULL, 
  unused_fn = NULL, ...
)

20 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_wider(
  data,
  id_cols = NULL, id_expand = FALSE,
  names_from = name,
  names_prefix = "", names_sep = "_", 
  names_glue = NULL, 
  names_sort = FALSE, 
  names_vary = "fastest",
  names_expand = FALSE, 
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL, values_fn = NULL, 
  unused_fn = NULL, ...
)

table2 %>% 
  pivot_wider(
    names_from = type, 
    values_from = count
  )

#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <int> <int>      <int>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # … with 3 more rows

20 / 32

>> `separate()`、`extract()` 和 `unite()`

21 / 32

>> `separate()`、`extract()` 和 `unite()`

separate(
  data, col, into,
  sep = "[^[:alnum:]]+", 
  remove = TRUE, convert = FALSE, 
  extra = "warn", fill = "warn", ...
)
extract(
  data, col, into,
  regex = "([[:alnum:]]+)", 
  remove = TRUE, convert = FALSE, ...
)
separate_rows(data, ..., sep, convert)

21 / 32

>> `separate()`、`extract()` 和 `unite()`

separate(
  data, col, into,
  sep = "[^[:alnum:]]+", 
  remove = TRUE, convert = FALSE, 
  extra = "warn", fill = "warn", ...
)
extract(
  data, col, into,
  regex = "([[:alnum:]]+)", 
  remove = TRUE, convert = FALSE, ...
)
separate_rows(data, ..., sep, convert)

table3 %>% 
  separate(
    rate,
    into = c("cases", "population"),
    convert = TRUE
  )

#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <int> <int>      <int>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # … with 3 more rows

21 / 32

>> `separate()`、`extract()` 和 `unite()`

22 / 32

>> `separate()`、`extract()` 和 `unite()`

unite(
  data, 
  col,
  ...,
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)

22 / 32

>> `separate()`、`extract()` 和 `unite()`

unite(
  data, 
  col,
  ...,
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)

table5 %>% 
  unite(
    col = "year", 
    century, year, sep = ""
  )

#> # A tibble: 6 × 3
#>   country     year  rate           
#>   <chr>       <chr> <chr>          
#> 1 Afghanistan 1999  745/19987071   
#> 2 Afghanistan 2000  2666/20595360  
#> 3 Brazil      1999  37737/172006362
#> # … with 3 more rows

22 / 32

>> 复杂例子

who    # ?who

#> # A tibble: 7,240 × 60
#>    country     iso2  iso3   year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷
#>    <chr>       <chr> <chr> <int>     <int>   <int>   <int>   <int>   <int>   <int>   <int>
#>  1 Afghanistan AF    AFG    1980        NA      NA      NA      NA      NA      NA      NA
#>  2 Afghanistan AF    AFG    1981        NA      NA      NA      NA      NA      NA      NA
#>  3 Afghanistan AF    AFG    1982        NA      NA      NA      NA      NA      NA      NA
#>  4 Afghanistan AF    AFG    1983        NA      NA      NA      NA      NA      NA      NA
#>  5 Afghanistan AF    AFG    1984        NA      NA      NA      NA      NA      NA      NA
#>  6 Afghanistan AF    AFG    1985        NA      NA      NA      NA      NA      NA      NA
#>  7 Afghanistan AF    AFG    1986        NA      NA      NA      NA      NA      NA      NA
#>  8 Afghanistan AF    AFG    1987        NA      NA      NA      NA      NA      NA      NA
#>  9 Afghanistan AF    AFG    1988        NA      NA      NA      NA      NA      NA      NA
#> 10 Afghanistan AF    AFG    1989        NA      NA      NA      NA      NA      NA      NA
#> # … with 7,230 more rows, 49 more variables: new_sp_f014 <int>, new_sp_f1524 <int>,
#> #   new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
#> #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
#> #   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
#> #   new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
#> #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
#> #   new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, …

23 / 32

>> 复杂例子who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = "name", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1

#> # A tibble: 76,046 × 6
#>   country     iso2  iso3   year name         cases
#>   <chr>       <chr> <chr> <int> <chr>        <int>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> # … with 76,043 more rows
24 / 32

>> 复杂例子

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = "name", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1

#> # A tibble: 76,046 × 6
#>   country     iso2  iso3   year name         cases
#>   <chr>       <chr> <chr> <int> <chr>        <int>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> # … with 76,043 more rows

变量 name 中的字符串代表什么？通过查阅 who 的数据字典得到如下信息：

前3个字母代表是否是“新增”肺结核病例，在数据中均为new

接下来2-3个字母代表肺结核的类型（sp、sn、ep、rel等）

接下来的字母m或f代表肺结核病人的性别（男或女）

余下数字代表年龄组，如 014 = 0 - 14岁 | 1524 = 15 - 24岁

24 / 32

>> 复杂例子who2 <- who1 %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% 
  separate(name, c(NA, "type", "sexage"), sep = "_")  # 基于分隔符
who2

#> # A tibble: 76,046 × 7
#>   country     iso2  iso3   year type  sexage cases
#>   <chr>       <chr> <chr> <int> <chr> <chr>  <int>
#> 1 Afghanistan AF    AFG    1997 sp    m014       0
#> 2 Afghanistan AF    AFG    1997 sp    m1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m2534      6
#> # … with 76,043 more rows
25 / 32

>> 复杂例子who2 <- who1 %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% 
  separate(name, c(NA, "type", "sexage"), sep = "_")  # 基于分隔符
who2

#> # A tibble: 76,046 × 7
#>   country     iso2  iso3   year type  sexage cases
#>   <chr>       <chr> <chr> <int> <chr> <chr>  <int>
#> 1 Afghanistan AF    AFG    1997 sp    m014       0
#> 2 Afghanistan AF    AFG    1997 sp    m1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m2534      6
#> # … with 76,043 more rows
who3 <- who2 %>% separate(sexage, c("sex", "age"), sep = 1)  # 基于位置
who3

#> # A tibble: 76,046 × 8
#>   country     iso2  iso3   year type  sex   age   cases
#>   <chr>       <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF    AFG    1997 sp    m     014       0
#> 2 Afghanistan AF    AFG    1997 sp    m     1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m     2534      6
#> # … with 76,043 more rows
25 / 32

>> 复杂例子# all in a pipe!
who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65, 
               names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
  separate(name, c(NA, "type", "sexage")) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

26 / 32

>> 复杂例子

# all in a pipe!
who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65, 
               names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
  separate(name, c(NA, "type", "sexage")) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

# all with powerful pivot_longer()!
who %>%
  pivot_longer(
    cols = starts_with("new"),
    names_to = c("type", "sex", "age"),
    names_pattern = "new_?(.*)_(.)(.*)",
    names_ptypes = list(
      sex = factor(levels = c("f", "m")),
      age = factor(
        levels = c("014", "1524", "2534",
          "3544", "4554", "5564", "65"),
        ordered = TRUE)),
    values_to = "cases",
    values_drop_na = TRUE
  )

#> # A tibble: 76,046 × 8
#>   country  iso2  iso3   year type  sex  
#>   <chr>    <chr> <chr> <int> <chr> <fct>
#> 1 Afghani… AF    AFG    1997 sp    m    
#> 2 Afghani… AF    AFG    1997 sp    m    
#> 3 Afghani… AF    AFG    1997 sp    m    
#> 4 Afghani… AF    AFG    1997 sp    m    
#> 5 Afghani… AF    AFG    1997 sp    m    
#> # … with 76,041 more rows, and 2 more
#> #   variables: age <ord>, cases <int>

👉 vignette("pivot")

26 / 32

🙋‍♂️ Your Turn!

05:00

在空白处填入恰当的代码将数据集 billboard 转化为目标结构：

library(tidyverse) billboard %>% View() billboard %>% pivot_ ( cols = , names_to = , , names_transform = , values_to = , values_drop_na = TRUE )

# A tibble: 317 × 79
  artist    track date.ent…¹   wk1   wk2
  <chr>     <chr> <date>     <dbl> <dbl>
1 2 Pac     Baby… 2000-02-26    87    82
2 2Ge+her   The … 2000-09-02    91    87
3 3 Doors … Kryp… 2000-04-08    81    70
# … with 314 more rows, 74 more
#   variables: wk3 <dbl>, wk4 <dbl>,
#   wk5 <dbl>, wk6 <dbl>, wk7 <dbl>,
#   wk8 <dbl>, wk9 <dbl>, wk10 <dbl>,
#   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>,
#   wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
#   wk17 <dbl>, wk18 <dbl>, …

# A tibble: 5,307 × 5
  artist track    date.ent…¹  week  rank
  <chr>  <chr>    <date>     <int> <dbl>
1 2 Pac  Baby Do… 2000-02-26     1    87
2 2 Pac  Baby Do… 2000-02-26     2    82
3 2 Pac  Baby Do… 2000-02-26     3    72
# … with 5,304 more rows, and
#   abbreviated variable name
#   ¹date.entered

library(tidyverse)
billboard %>% View()
billboard %>%
  pivot_longer(
    cols = starts_with("wk") ,
    names_to = "week"        ,
    names_prefix = "wk"      ,
    names_transform = list(week = as.integer),
    values_to = "rank"       ,
    values_drop_na = TRUE
  )

27 / 32

>> `unnest_*` 和 `hoist()`

repurrrsive::gh_repos %>% View()

28 / 32

>> `unnest_*` 和 `hoist()`

repurrrsive::gh_repos %>% View()

(repos <- 
   tibble(repo = repurrrsive::gh_repos))

#> # A tibble: 6 × 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> # … with 1 more row

(repos <- repos %>% unnest_longer(repo))

#> # A tibble: 176 × 1
#>   repo             
#>   <list>           
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> # … with 171 more rows

28 / 32

>> `unnest_*` 和 `hoist()`

repos %>% 
  unnest_wider(repo)

#> # A tibble: 176 × 68
#>         id name     full_…¹ owner       
#>      <int> <chr>    <chr>   <list>      
#> 1 61160198 after    gaborc… <named list>
#> 2 40500181 argufy   gaborc… <named list>
#> 3 36442442 ask      gaborc… <named list>
#> 4 34924886 baseimp… gaborc… <named list>
#> 5 61620661 citest   gaborc… <named list>
#> # … with 171 more rows, 64 more
#> #   variables: private <lgl>,
#> #   html_url <chr>, description <chr>,
#> #   fork <lgl>, url <chr>,
#> #   forks_url <chr>, keys_url <chr>,
#> #   collaborators_url <chr>,
#> #   teams_url <chr>, hooks_url <chr>, …

29 / 32

>> `unnest_*` 和 `hoist()`

repos %>% 
  unnest_wider(repo)

#> # A tibble: 176 × 68
#>         id name     full_…¹ owner       
#>      <int> <chr>    <chr>   <list>      
#> 1 61160198 after    gaborc… <named list>
#> 2 40500181 argufy   gaborc… <named list>
#> 3 36442442 ask      gaborc… <named list>
#> 4 34924886 baseimp… gaborc… <named list>
#> 5 61620661 citest   gaborc… <named list>
#> # … with 171 more rows, 64 more
#> #   variables: private <lgl>,
#> #   html_url <chr>, description <chr>,
#> #   fork <lgl>, url <chr>,
#> #   forks_url <chr>, keys_url <chr>,
#> #   collaborators_url <chr>,
#> #   teams_url <chr>, hooks_url <chr>, …

repos %>% hoist(
  repo, 
  login = c("owner", "login"),
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)

#> # A tibble: 176 × 5
#>   login       name       homep…¹ watch…²
#>   <chr>       <chr>      <chr>     <int>
#> 1 gaborcsardi after      <NA>          5
#> 2 gaborcsardi argufy     <NA>         19
#> 3 gaborcsardi ask        <NA>          5
#> 4 gaborcsardi baseimpor… <NA>          0
#> 5 gaborcsardi citest     <NA>          0
#> # … with 171 more rows, 1 more
#> #   variable: repo <list>, and
#> #   abbreviated variable names
#> #   ¹homepage, ²watchers

👉 vignette("rectangle")

29 / 32

课后作业30 / 32

1. 复习 📖 R for Data Science 一书的第10章、第11章和第12章的内容

2. 下载（打印） 📰 {{Data Import的cheatsheet}} 和 {{Data Tidying的cheatsheet}} 并阅读之

3. browseVignettes(package = c("readr", "tidyr")) 📝

Introduction to readr
Tidy data
Pivoting

4. 完成第六讲的课后练习 👩‍💻

remotes::install_github("qfwr2022/qfwr")
# 记得选择允许安装qfwr包所依赖的learnr、testwhat等包
library(qfwr)
qfwr_ex("L06")

31 / 32

本网页版讲义的制作由 R 包 {{xaringan}} 赋能！

32 / 32

量化金融与金融编程
L06 数据导入与齐整

曾永艺
厦门大学管理学院

2022-11-11
1 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

2 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham

2 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham

1. 数据导入

读入矩形文本数据：readr 包 ^2.1.3
读入其它类型的数据

2 / 32

问渠那得清如许？为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham

1. 数据导入

读入矩形文本数据：readr 包 ^2.1.3
读入其它类型的数据

2. 数据齐整

齐整数据
tidyr 包 ^1.2.1
pivot_longer() 和 pivot_wider()
separate() 和 unite()
复杂案例
unnest_*() 和 hoist()

2 / 32

1. 数据导入

(Data Import)

3 / 32

>> 读入文本文档：`readr` 包

☑ 一致的函数命名方式

read_csv()   | read_csv2()      | read_tsv()   | read_delim()    |
read_table() | read_fwf()       |
read_lines() | read_lines_raw() | read_file()  | read_file_raw() | read_log()  |

4 / 32

>> 读入文本文档：`readr` 包

☑ 一致的函数命名方式

read_csv()   | read_csv2()      | read_tsv()   | read_delim()    |
read_table() | read_fwf()       |
read_lines() | read_lines_raw() | read_file()  | read_file_raw() | read_log()  |

☑ 基本一致的参数设置

read_*(file,  # delim = NULL, escape_backslash = FALSE, escape_double = TRUE,
       col_names = TRUE, col_types = NULL, col_select = NULL, 
       id = NULL, locale = default_locale(), 
       na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, 
       skip = 0, n_max = Inf, guess_max = min(1000, n_max), 
       name_repair = "unique", num_threads = readr_threads(), 
       progress = show_progress(), show_col_types = should_show_types(), 
       skip_empty_rows = TRUE, lazy = should_read_lazy())
# file: Either a path to a file, a connection, or literal data. 
#       Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.
#       Files starting with http://, https://, ftp://, or ftps:// will be downloaded.
#       Using a value of clipboard() will read from the system clipboard.

4 / 32

>> 读入文本文档：`readr` 包

☑ 和内置 utils 包中 read.*() 比较

5 / 32

>> 读入文本文档：`readr` 包

☑ 和内置 utils 包中 read.*() 比较

生成 tibble^*（更确切地说应该是 spec_tbl_df）
默认情况下做得更少，如不会自动将字符型变量转换为因子型变量（read.*()有个stringsAsFactors参数）、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量，结果更具可重现性（reproducible）
读入速度通常更快（~50x），且读入大文件时有进度条提示

5 / 32

>> 读入文本文档：`readr` 包

☑ 和内置 utils 包中 read.*() 比较

生成 tibble^*（更确切地说应该是 spec_tbl_df）
默认情况下做得更少，如不会自动将字符型变量转换为因子型变量（read.*()有个stringsAsFactors参数）、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量，结果更具可重现性（reproducible）
读入速度通常更快（~50x），且读入大文件时有进度条提示

?`tbl_df-class`
tibble: a subclass of data.frame and the central data structure for the tidyverse
lazy and surly - do less and complain more than base data.frames.
======================================================================================
1. 列数据不进行自动类型转换（如字符->因子），且原生直接支持列表列
2. 只对长度为1的“标量”向量进行循环操作
3. tibble支持不合法的数据变量名，如 tibble(`:)` = "smile")
4. 不增加行名，也不鼓励使用行名来存储数据信息
5. 用 [ 对tibble取子集操作总是返回新的tibble，[[ 和 $ 总是返回向量，且 $ 不支持变量名部分匹配
6. tibble在屏幕上打印输出时会适应屏幕大小，并提供更多有用信息（类似str()）

5 / 32

>> 读入文本文档：`readr` 包

常用参数示例：read_csv()

6 / 32

>> 读入文本文档：`readr` 包

常用参数示例：read_csv()

library(tidyverse)
read_csv("The 1st line of metadata
         The 2nd line of metadata
         # A comment to skip
         x,y,z
         1,2,3", 
         skip = 2,
         comment = "#")

#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

6 / 32

>> 读入文本文档：`readr` 包

常用参数示例：read_csv()

library(tidyverse)
read_csv("The 1st line of metadata
         The 2nd line of metadata
         # A comment to skip
         x,y,z
         1,2,3", 
         skip = 2,
         comment = "#")

#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

read_csv("1,2,3\n4,5,6")

#> # A tibble: 1 × 3
#>     `1`   `2`   `3`
#>   <dbl> <dbl> <dbl>
#> 1     4     5     6

read_csv("1,2,3\n4,5,6", 
         col_names = FALSE)

#> # A tibble: 2 × 3
#>      X1    X2    X3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

read_csv("1,2,3\n4,5,6", 
         col_names = c("x", "y", "z"))

#> # A tibble: 2 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

6 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

#  生成示例数据文档
set.seed(123456)   # 设定随机数种子
nycflights13::flights %>% 
  mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% 
  select(month:dep_time, dep_delay, tailnum, time_hour) %>% 
  slice_sample(n = 20) %>%  
  mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% 
  mutate(time_hour = lubridate::as_datetime(time_hour)) %>% 
  print(n = 5) %>%
  write_csv(file = "data/ex_flights.csv", na = "--")  # 默认na = "NA"

#> # A tibble: 20 × 6
#>   month   day dep_time dep_delay tailnum time_hour          
#>   <int> <int>    <int> <lgl>     <chr>   <dttm>             
#> 1     6    26      932 TRUE      <NA>    2013-06-26 12:00:00
#> 2    NA     5       NA NA        <NA>    2013-12-05 17:00:00
#> 3     7    20      656 FALSE     N17245  2013-07-20 10:00:00
#> 4     5    16       NA FALSE     N27152  2013-05-16 12:00:00
#> 5    12    23       NA NA        <NA>    2013-12-23 23:00:00
#> # … with 15 more rows

7 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

(ex_flights <- read_csv("data/ex_flights.csv"))  # 默认na = c("", "NA")

#> Rows: 20 Columns: 6
#> ── Column specification ──────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#> # A tibble: 20 × 6
#>   month day   dep_time dep_delay tailnum time_hour           
#>   <chr> <chr> <chr>    <chr>     <chr>   <chr>               
#> 1 6     26    932      TRUE      --      2013-06-26T12:00:00Z
#> 2 --    5     --       --        --      2013-12-05T17:00:00Z
#> 3 7     20    656      FALSE     N17245  2013-07-20T10:00:00Z
#> 4 5     16    --       FALSE     N27152  2013-05-16T12:00:00Z
#> 5 12    23    --       --        --      2013-12-23T23:00:00Z
#> # … with 15 more rows

8 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由 col_types 参数直接给定
2.2 猜测：读入文档 guess_max 行（v2 vs. v1），并按如下顺序猜测每列变量的类型：logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量

9 / 32

>> 读入文本文档：`readr` 包

readr 包采用启发式策略来解析文本文档中数据

1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由 col_types 参数直接给定
2.2 猜测：读入文档 guess_max 行（v2 vs. v1），并按如下顺序猜测每列变量的类型：logical -> ~~integer ->~~ double -> ~~number ->~~ time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量

当发现 readr 自动解析碰到问题时

可用 problems() 查看读入时遇到的问题
根据需要设定 read_*() 的相关参数，如 na = "--"
直接设定 read_*() 的参数 col_types = cols(...)
将全部数据读入为字符型变量，然后再用 type_convert() 转换变量类型
在后续的数据整理步骤中进行相应的处理（如 dplyr::muate() + parse_*()）

9 / 32

>> 读入文本文档：readr 包# 设定缺失值参数
read_csv("data/ex_flights.csv", 
         na = "--") %>% spec()

#> cols(
#>   month = col_double(),
#>   day = col_double(),
#>   dep_time = col_double(),
#>   dep_delay = col_logical(),
#>   tailnum = col_character(),
#>   time_hour = col_datetime(format = "")
#> )
# 直接设定参数col_types，?cols
ex_flights <- read_csv(
  "data/ex_flights.csv", 
  col_types = cols(  # cols_only()
    dep_delay = col_logical(),    # "l"
    tailnum   = col_character(),  # "c"
    time_hour = col_datetime(),   # "T"
    .default  = col_integer()
  ) # col_types = "iiilcT"
)

#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
10 / 32

>> 读入文本文档：readr 包# 设定缺失值参数
read_csv("data/ex_flights.csv", 
         na = "--") %>% spec()

#> cols(
#>   month = col_double(),
#>   day = col_double(),
#>   dep_time = col_double(),
#>   dep_delay = col_logical(),
#>   tailnum = col_character(),
#>   time_hour = col_datetime(format = "")
#> )
# 直接设定参数col_types，?cols
ex_flights <- read_csv(
  "data/ex_flights.csv", 
  col_types = cols(  # cols_only()
    dep_delay = col_logical(),    # "l"
    tailnum   = col_character(),  # "c"
    time_hour = col_datetime(),   # "T"
    .default  = col_integer()
  ) # col_types = "iiilcT"
)

#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
problems(ex_flights)

#> # A tibble: 17 × 5
#>     row   col expected           actual file                                              
#>   <int> <int> <chr>              <chr>  <chr>                                             
#> 1     3     1 an integer         --     C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> 2     3     3 an integer         --     C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> 3     3     4 1/0/T/F/TRUE/FALSE --     C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> # … with 14 more rows
# 全部读入，再行转换类型
read_csv(
  "data/ex_flights.csv",
  col_types = cols(
    .default = col_character()
  )
) %>% 
  type_convert(col_types = "iiilcT")

#> # A tibble: 20 × 6
#>   month   day dep_time dep_delay tailnum time_hour          
#>   <int> <int>    <int> <lgl>     <chr>   <dttm>             
#> 1     6    26      932 TRUE      --      2013-06-26 12:00:00
#> 2    NA     5       NA NA        --      2013-12-05 17:00:00
#> 3     7    20      656 FALSE     N17245  2013-07-20 10:00:00
#> # … with 17 more rows
10 / 32

>> 读入文本文档：`readr` 包

一次性读入多个相同性质的文档

# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>% 
  purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))

11 / 32

>> 读入文本文档：`readr` 包

一次性读入多个相同性质的文档

# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>% 
  purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))

# 文件路径向量
(files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE))

#>  [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv"
#>  [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv"
#>  [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv"
#> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv"
#> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv"
#> [16] "data/flights_YV.tsv"

# 一次性读入（并纵向合并为 tibble）
read_tsv(files, id = "fpath")  # 这里通过设置 id 参数将文件路径存入指定变量fpath中

#> # A tibble: 336,776 × 20
#>   fpath    year month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight
#>   <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    <dbl>
#> 1 data/f…  2013     1     1     810     810       0    1048    1037      11 9E        3538
#> 2 data/f…  2013     1     1    1451    1500      -9    1634    1636      -2 9E        4105
#> 3 data/f…  2013     1     1    1452    1455      -3    1637    1639      -2 9E        3295
#> # … with 336,773 more rows, 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and
#> #   abbreviated variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
#> #   ⁵sched_arr_time, ⁶arr_delay

11 / 32

>> 读入文本文档：`readr` 包

readr 的图形化用户界面 ^*

* RStudio 右上 Environment 标签页 > Import Dataset > From Text (readr) ...

12 / 32

>> 读入其它类型的数据13 / 32

>> 读入其它类型的数据readr::read_rds(): R data files (.rds)

13 / 32

>> 读入其它类型的数据

readr::read_rds(): R data files (.rds)

~~{{vroom}}~~: Read and Write Rectangular Text Data Quickly
{{readxl}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{haven}}: SPSS, Stata, and SAS files
{{arrow}}: Apache Arrow

13 / 32

>> 读入其它类型的数据

readr::read_rds(): R data files (.rds)

~~{{vroom}}~~: Read and Write Rectangular Text Data Quickly
{{readxl}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{haven}}: SPSS, Stata, and SAS files
{{arrow}}: Apache Arrow

{{DBI}} with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL
jsonlite: JSON files
{{xml2}}: XML and HTML files
{{rvest}}: HTML (Web Scraping)
{{httr2}}: Web APIs

13 / 32

>> 读入其它类型的数据

readr::read_rds(): R data files (.rds)

~~{{vroom}}~~: Read and Write Rectangular Text Data Quickly
{{readxl}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{haven}}: SPSS, Stata, and SAS files
{{arrow}}: Apache Arrow

{{DBI}} with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL
jsonlite: JSON files
{{xml2}}: XML and HTML files
{{rvest}}: HTML (Web Scraping)
{{httr2}}: Web APIs

{{其它}}: text、network、spatial、genome、image 等类型的数据

13 / 32

2. 数据齐整

(Tidy Data)

14 / 32

>> 齐整数据

齐整数据的三个条件 📐

每列都是一个变量（Every column is a variable.）
每行都是一个观测（Every row is an observation.）
每格都是一个取值（Every cell is a single value.）

15 / 32

>> 齐整数据

齐整数据的三个条件 📐

每列都是一个变量（Every column is a variable.）
每行都是一个观测（Every row is an observation.）
每格都是一个取值（Every cell is a single value.）

15 / 32

>> 齐整数据

齐整数据的好处 💪

☑ 齐整数据按照逻辑一致的方式存储数据，这让你更容易学习并掌握相关工具对数据进行处理

☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势

☑ tidyverse 中的 R 包（如 dplyr 等）在设计上要求输入数据为齐整数据

16 / 32

>> 齐整数据

但……并非所有的数据集都是齐整的，😭

table1

#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <int> <int>      <int>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # … with 3 more rows

table2

#> # A tibble: 12 × 4
#>   country      year type          count
#>   <chr>       <int> <chr>         <int>
#> 1 Afghanistan  1999 cases           745
#> 2 Afghanistan  1999 population 19987071
#> 3 Afghanistan  2000 cases          2666
#> # … with 9 more rows

table3

#> # A tibble: 6 × 3
#>   country      year rate           
#> * <chr>       <int> <chr>          
#> 1 Afghanistan  1999 745/19987071   
#> 2 Afghanistan  2000 2666/20595360  
#> 3 Brazil       1999 37737/172006362
#> # … with 3 more rows

table4a   # cases

#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b   # population

#> # A tibble: 3 × 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

17 / 32

>> 该 `tidyr` 包出场啦 🎉

tidyr包中的函数大致可分为5大类：

数据长型-宽型转换（pivoting）：pivot_longer() 和 pivot_wider()
分解或合并字符型变量：separate()、separate_rows()、extract() 和 unite()
将深度嵌套的列表数据表格化（rectangling）：unnest_longer()、unnest_wider()、unnest_auto() 和 hoist()
将分组数据框转化成嵌套的数据框（每组占一行）或反之： nest() 和 unnest()
缺失值处理：->

18 / 32

>> 该 `tidyr` 包出场啦 🎉

tidyr包中的函数大致可分为5大类：

数据长型-宽型转换（pivoting）：pivot_longer() 和 pivot_wider()
分解或合并字符型变量：separate()、separate_rows()、extract() 和 unite()
将深度嵌套的列表数据表格化（rectangling）：unnest_longer()、unnest_wider()、unnest_auto() 和 hoist()
将分组数据框转化成嵌套的数据框（每组占一行）或反之： nest() 和 unnest()
缺失值处理：->

complete(data, ..., fill = list()): Complete a data frame with missing combinations of data
expand(data, ...), crossing(...), nesting(...): Expand data frame to include all combinations of values
expand_grid(...): Create a tibble from all combinations of the inputted name-value pairs
full_seq(x, period, tol = 1e-06): Create the full sequence of values in a vector
drop_na(data, ...): Drop rows containing missing values
fill(data, ..., .direction = c("down", "up", "downup", "updown")): Fill in missing values with previous or next value
replace_na(data, replace, ...): Replace missing values with specified values

18 / 32

>> `pivot_longer()` 和 `pivot_wider()`

19 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_longer(
  data, cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL, 
  names_pattern = NULL, 
  names_ptypes = NULL, 
  names_transform = NULL, 
  names_repair = "check_unique", 
  values_to = "value",
  values_drop_na = FALSE, 
  values_ptypes = NULL,
  values_transform = NULL, ...
)

19 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_longer(
  data, cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL, 
  names_pattern = NULL, 
  names_ptypes = NULL, 
  names_transform = NULL, 
  names_repair = "check_unique", 
  values_to = "value",
  values_drop_na = FALSE, 
  values_ptypes = NULL,
  values_transform = NULL, ...
)

table4a %>% 
  pivot_longer(
    cols = c(`1999`, `2000`),
    names_to = "year", 
    values_to = "cases"
  )
# table4b %>% ...

#> # A tibble: 6 × 3
#>   country     year  cases
#>   <chr>       <chr> <int>
#> 1 Afghanistan 1999    745
#> 2 Afghanistan 2000   2666
#> 3 Brazil      1999  37737
#> # … with 3 more rows

19 / 32

>> `pivot_longer()` 和 `pivot_wider()`

20 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_wider(
  data,
  id_cols = NULL, id_expand = FALSE,
  names_from = name,
  names_prefix = "", names_sep = "_", 
  names_glue = NULL, 
  names_sort = FALSE, 
  names_vary = "fastest",
  names_expand = FALSE, 
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL, values_fn = NULL, 
  unused_fn = NULL, ...
)

20 / 32

>> `pivot_longer()` 和 `pivot_wider()`

pivot_wider(
  data,
  id_cols = NULL, id_expand = FALSE,
  names_from = name,
  names_prefix = "", names_sep = "_", 
  names_glue = NULL, 
  names_sort = FALSE, 
  names_vary = "fastest",
  names_expand = FALSE, 
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL, values_fn = NULL, 
  unused_fn = NULL, ...
)

table2 %>% 
  pivot_wider(
    names_from = type, 
    values_from = count
  )

#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <int> <int>      <int>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # … with 3 more rows

20 / 32

>> `separate()`、`extract()` 和 `unite()`

21 / 32

>> `separate()`、`extract()` 和 `unite()`

separate(
  data, col, into,
  sep = "[^[:alnum:]]+", 
  remove = TRUE, convert = FALSE, 
  extra = "warn", fill = "warn", ...
)
extract(
  data, col, into,
  regex = "([[:alnum:]]+)", 
  remove = TRUE, convert = FALSE, ...
)
separate_rows(data, ..., sep, convert)

21 / 32

>> `separate()`、`extract()` 和 `unite()`

separate(
  data, col, into,
  sep = "[^[:alnum:]]+", 
  remove = TRUE, convert = FALSE, 
  extra = "warn", fill = "warn", ...
)
extract(
  data, col, into,
  regex = "([[:alnum:]]+)", 
  remove = TRUE, convert = FALSE, ...
)
separate_rows(data, ..., sep, convert)

table3 %>% 
  separate(
    rate,
    into = c("cases", "population"),
    convert = TRUE
  )

#> # A tibble: 6 × 4
#>   country      year cases population
#>   <chr>       <int> <int>      <int>
#> 1 Afghanistan  1999   745   19987071
#> 2 Afghanistan  2000  2666   20595360
#> 3 Brazil       1999 37737  172006362
#> # … with 3 more rows

21 / 32

>> `separate()`、`extract()` 和 `unite()`

22 / 32

>> `separate()`、`extract()` 和 `unite()`

unite(
  data, 
  col,
  ...,
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)

22 / 32

>> `separate()`、`extract()` 和 `unite()`

unite(
  data, 
  col,
  ...,
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)

table5 %>% 
  unite(
    col = "year", 
    century, year, sep = ""
  )

#> # A tibble: 6 × 3
#>   country     year  rate           
#>   <chr>       <chr> <chr>          
#> 1 Afghanistan 1999  745/19987071   
#> 2 Afghanistan 2000  2666/20595360  
#> 3 Brazil      1999  37737/172006362
#> # … with 3 more rows

22 / 32

>> 复杂例子

who    # ?who

#> # A tibble: 7,240 × 60
#>    country     iso2  iso3   year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷
#>    <chr>       <chr> <chr> <int>     <int>   <int>   <int>   <int>   <int>   <int>   <int>
#>  1 Afghanistan AF    AFG    1980        NA      NA      NA      NA      NA      NA      NA
#>  2 Afghanistan AF    AFG    1981        NA      NA      NA      NA      NA      NA      NA
#>  3 Afghanistan AF    AFG    1982        NA      NA      NA      NA      NA      NA      NA
#>  4 Afghanistan AF    AFG    1983        NA      NA      NA      NA      NA      NA      NA
#>  5 Afghanistan AF    AFG    1984        NA      NA      NA      NA      NA      NA      NA
#>  6 Afghanistan AF    AFG    1985        NA      NA      NA      NA      NA      NA      NA
#>  7 Afghanistan AF    AFG    1986        NA      NA      NA      NA      NA      NA      NA
#>  8 Afghanistan AF    AFG    1987        NA      NA      NA      NA      NA      NA      NA
#>  9 Afghanistan AF    AFG    1988        NA      NA      NA      NA      NA      NA      NA
#> 10 Afghanistan AF    AFG    1989        NA      NA      NA      NA      NA      NA      NA
#> # … with 7,230 more rows, 49 more variables: new_sp_f014 <int>, new_sp_f1524 <int>,
#> #   new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
#> #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
#> #   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
#> #   new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
#> #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
#> #   new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, …

23 / 32

>> 复杂例子who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = "name", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1

#> # A tibble: 76,046 × 6
#>   country     iso2  iso3   year name         cases
#>   <chr>       <chr> <chr> <int> <chr>        <int>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> # … with 76,043 more rows
24 / 32

>> 复杂例子

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = "name", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1

#> # A tibble: 76,046 × 6
#>   country     iso2  iso3   year name         cases
#>   <chr>       <chr> <chr> <int> <chr>        <int>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> # … with 76,043 more rows

变量 name 中的字符串代表什么？通过查阅 who 的数据字典得到如下信息：

前3个字母代表是否是“新增”肺结核病例，在数据中均为new

接下来2-3个字母代表肺结核的类型（sp、sn、ep、rel等）

接下来的字母m或f代表肺结核病人的性别（男或女）

余下数字代表年龄组，如 014 = 0 - 14岁 | 1524 = 15 - 24岁

24 / 32

>> 复杂例子who2 <- who1 %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% 
  separate(name, c(NA, "type", "sexage"), sep = "_")  # 基于分隔符
who2

#> # A tibble: 76,046 × 7
#>   country     iso2  iso3   year type  sexage cases
#>   <chr>       <chr> <chr> <int> <chr> <chr>  <int>
#> 1 Afghanistan AF    AFG    1997 sp    m014       0
#> 2 Afghanistan AF    AFG    1997 sp    m1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m2534      6
#> # … with 76,043 more rows
25 / 32

>> 复杂例子who2 <- who1 %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% 
  separate(name, c(NA, "type", "sexage"), sep = "_")  # 基于分隔符
who2

#> # A tibble: 76,046 × 7
#>   country     iso2  iso3   year type  sexage cases
#>   <chr>       <chr> <chr> <int> <chr> <chr>  <int>
#> 1 Afghanistan AF    AFG    1997 sp    m014       0
#> 2 Afghanistan AF    AFG    1997 sp    m1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m2534      6
#> # … with 76,043 more rows
who3 <- who2 %>% separate(sexage, c("sex", "age"), sep = 1)  # 基于位置
who3

#> # A tibble: 76,046 × 8
#>   country     iso2  iso3   year type  sex   age   cases
#>   <chr>       <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF    AFG    1997 sp    m     014       0
#> 2 Afghanistan AF    AFG    1997 sp    m     1524     10
#> 3 Afghanistan AF    AFG    1997 sp    m     2534      6
#> # … with 76,043 more rows
25 / 32

>> 复杂例子# all in a pipe!
who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65, 
               names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
  separate(name, c(NA, "type", "sexage")) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

26 / 32

>> 复杂例子

# all in a pipe!
who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65, 
               names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% 
  mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
  separate(name, c(NA, "type", "sexage")) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

# all with powerful pivot_longer()!
who %>%
  pivot_longer(
    cols = starts_with("new"),
    names_to = c("type", "sex", "age"),
    names_pattern = "new_?(.*)_(.)(.*)",
    names_ptypes = list(
      sex = factor(levels = c("f", "m")),
      age = factor(
        levels = c("014", "1524", "2534",
          "3544", "4554", "5564", "65"),
        ordered = TRUE)),
    values_to = "cases",
    values_drop_na = TRUE
  )

#> # A tibble: 76,046 × 8
#>   country  iso2  iso3   year type  sex  
#>   <chr>    <chr> <chr> <int> <chr> <fct>
#> 1 Afghani… AF    AFG    1997 sp    m    
#> 2 Afghani… AF    AFG    1997 sp    m    
#> 3 Afghani… AF    AFG    1997 sp    m    
#> 4 Afghani… AF    AFG    1997 sp    m    
#> 5 Afghani… AF    AFG    1997 sp    m    
#> # … with 76,041 more rows, and 2 more
#> #   variables: age <ord>, cases <int>

👉 vignette("pivot")

26 / 32

🙋‍♂️ Your Turn!

05:00

在空白处填入恰当的代码将数据集 billboard 转化为目标结构：

library(tidyverse) billboard %>% View() billboard %>% pivot_ ( cols = , names_to = , , names_transform = , values_to = , values_drop_na = TRUE )

# A tibble: 317 × 79
  artist    track date.ent…¹   wk1   wk2
  <chr>     <chr> <date>     <dbl> <dbl>
1 2 Pac     Baby… 2000-02-26    87    82
2 2Ge+her   The … 2000-09-02    91    87
3 3 Doors … Kryp… 2000-04-08    81    70
# … with 314 more rows, 74 more
#   variables: wk3 <dbl>, wk4 <dbl>,
#   wk5 <dbl>, wk6 <dbl>, wk7 <dbl>,
#   wk8 <dbl>, wk9 <dbl>, wk10 <dbl>,
#   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>,
#   wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
#   wk17 <dbl>, wk18 <dbl>, …

# A tibble: 5,307 × 5
  artist track    date.ent…¹  week  rank
  <chr>  <chr>    <date>     <int> <dbl>
1 2 Pac  Baby Do… 2000-02-26     1    87
2 2 Pac  Baby Do… 2000-02-26     2    82
3 2 Pac  Baby Do… 2000-02-26     3    72
# … with 5,304 more rows, and
#   abbreviated variable name
#   ¹date.entered

library(tidyverse)
billboard %>% View()
billboard %>%
  pivot_longer(
    cols = starts_with("wk") ,
    names_to = "week"        ,
    names_prefix = "wk"      ,
    names_transform = list(week = as.integer),
    values_to = "rank"       ,
    values_drop_na = TRUE
  )

27 / 32

>> `unnest_*` 和 `hoist()`

repurrrsive::gh_repos %>% View()

28 / 32

>> `unnest_*` 和 `hoist()`

repurrrsive::gh_repos %>% View()

(repos <- 
   tibble(repo = repurrrsive::gh_repos))

#> # A tibble: 6 × 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> # … with 1 more row

(repos <- repos %>% unnest_longer(repo))

#> # A tibble: 176 × 1
#>   repo             
#>   <list>           
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> # … with 171 more rows

28 / 32

>> unnest_* 和 hoist()repos %>% 
  unnest_wider(repo)

#> # A tibble: 176 × 68
#>         id name     full_…¹ owner       
#>      <int> <chr>    <chr>   <list>      
#> 1 61160198 after    gaborc… <named list>
#> 2 40500181 argufy   gaborc… <named list>
#> 3 36442442 ask      gaborc… <named list>
#> 4 34924886 baseimp… gaborc… <named list>
#> 5 61620661 citest   gaborc… <named list>
#> # … with 171 more rows, 64 more
#> #   variables: private <lgl>,
#> #   html_url <chr>, description <chr>,
#> #   fork <lgl>, url <chr>,
#> #   forks_url <chr>, keys_url <chr>,
#> #   collaborators_url <chr>,
#> #   teams_url <chr>, hooks_url <chr>, …
29 / 32

>> `unnest_*` 和 `hoist()`

repos %>% 
  unnest_wider(repo)

#> # A tibble: 176 × 68
#>         id name     full_…¹ owner       
#>      <int> <chr>    <chr>   <list>      
#> 1 61160198 after    gaborc… <named list>
#> 2 40500181 argufy   gaborc… <named list>
#> 3 36442442 ask      gaborc… <named list>
#> 4 34924886 baseimp… gaborc… <named list>
#> 5 61620661 citest   gaborc… <named list>
#> # … with 171 more rows, 64 more
#> #   variables: private <lgl>,
#> #   html_url <chr>, description <chr>,
#> #   fork <lgl>, url <chr>,
#> #   forks_url <chr>, keys_url <chr>,
#> #   collaborators_url <chr>,
#> #   teams_url <chr>, hooks_url <chr>, …

repos %>% hoist(
  repo, 
  login = c("owner", "login"),
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)

#> # A tibble: 176 × 5
#>   login       name       homep…¹ watch…²
#>   <chr>       <chr>      <chr>     <int>
#> 1 gaborcsardi after      <NA>          5
#> 2 gaborcsardi argufy     <NA>         19
#> 3 gaborcsardi ask        <NA>          5
#> 4 gaborcsardi baseimpor… <NA>          0
#> 5 gaborcsardi citest     <NA>          0
#> # … with 171 more rows, 1 more
#> #   variable: repo <list>, and
#> #   abbreviated variable names
#> #   ¹homepage, ²watchers

👉 vignette("rectangle")

29 / 32

课后作业30 / 32

1. 复习 📖 R for Data Science 一书的第10章、第11章和第12章的内容

2. 下载（打印） 📰 {{Data Import的cheatsheet}} 和 {{Data Tidying的cheatsheet}} 并阅读之

3. browseVignettes(package = c("readr", "tidyr")) 📝

Introduction to readr
Tidy data
Pivoting

4. 完成第六讲的课后练习 👩‍💻

remotes::install_github("qfwr2022/qfwr")
# 记得选择允许安装qfwr包所依赖的learnr、testwhat等包
library(qfwr)
qfwr_ex("L06")

31 / 32

本网页版讲义的制作由 R 包 {{xaringan}} 赋能！

32 / 32

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

量化金融与金融编程

L06 数据导入与齐整

曾永艺

厦门大学管理学院

2022-11-11

1. 数据导入

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入其它类型的数据

>> 读入其它类型的数据

>> 读入其它类型的数据

>> 读入其它类型的数据

>> 读入其它类型的数据

2. 数据齐整

>> 齐整数据

>> 齐整数据

>> 齐整数据

>> 齐整数据

>> 该 tidyr 包出场啦 🎉

>> 该 tidyr 包出场啦 🎉

>> pivot_longer() 和 pivot_wider()

>> pivot_longer() 和 pivot_wider()

>> pivot_longer() 和 pivot_wider()

>> pivot_longer() 和 pivot_wider()

>> pivot_longer() 和 pivot_wider()

>> pivot_longer() 和 pivot_wider()

>> separate()、extract() 和 unite()

>> separate()、extract() 和 unite()

>> separate()、extract() 和 unite()

>> separate()、extract() 和 unite()

>> separate()、extract() 和 unite()

>> separate()、extract() 和 unite()

>> 复杂例子

>> 复杂例子

>> 复杂例子

>> 复杂例子

>> 复杂例子

>> 复杂例子

>> 复杂例子

🙋‍♂️ Your Turn!

>> unnest_* 和 hoist()

>> unnest_* 和 hoist()

>> unnest_* 和 hoist()

>> unnest_* 和 hoist()

课后作业

Help

第6讲 数据导入与齐整

量化金融与金融编程

L06 数据导入与齐整

曾永艺

厦门大学管理学院

2022-11-11

1. 数据导入

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：readr 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 该 `tidyr` 包出场啦 🎉

>> 该 `tidyr` 包出场啦 🎉

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `unnest_*` 和 `hoist()`

>> `unnest_*` 和 `hoist()`

>> `unnest_*` 和 `hoist()`

>> `unnest_*` 和 `hoist()`

第6讲数据导入与齐整

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 读入文本文档：`readr` 包

>> 该 `tidyr` 包出场啦 🎉

>> 该 `tidyr` 包出场啦 🎉

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `pivot_longer()` 和 `pivot_wider()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `separate()`、`extract()` 和 `unite()`

>> `unnest_*` 和 `hoist()`

>> `unnest_*` 和 `hoist()`

>> `unnest_*` 和 `hoist()`

>> `unnest_*` 和 `hoist()`