+ - 0:00:00
Notes for current slide
Notes for next slide

量化金融与金融编程

L06 数据导入与齐整


曾永艺

厦门大学管理学院


2022-11-11

1 / 32

问渠那得清如许?为有源头活水来。——[宋]朱熹

2 / 32

问渠那得清如许?为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham


2 / 32

问渠那得清如许?为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham


1. 数据导入

  • 读入矩形文本数据:readr2.1.3
  • 读入其它类型的数据
2 / 32

问渠那得清如许?为有源头活水来。——[宋]朱熹

Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham


1. 数据导入

  • 读入矩形文本数据:readr2.1.3
  • 读入其它类型的数据

2. 数据齐整

  • 齐整数据
  • tidyr1.2.1
  • pivot_longer()pivot_wider()
  • separate()unite()
  • 复杂案例
  • unnest_*()hoist()
2 / 32

1. 数据导入

(Data Import)

3 / 32

>> 读入文本文档:readr

☑ 一致的函数命名方式

read_csv() | read_csv2() | read_tsv() | read_delim() |
read_table() | read_fwf() |
read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() |
4 / 32

>> 读入文本文档:readr

☑ 一致的函数命名方式

read_csv() | read_csv2() | read_tsv() | read_delim() |
read_table() | read_fwf() |
read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() |

☑ 基本一致的参数设置

read_*(file, # delim = NULL, escape_backslash = FALSE, escape_double = TRUE,
col_names = TRUE, col_types = NULL, col_select = NULL,
id = NULL, locale = default_locale(),
na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE,
skip = 0, n_max = Inf, guess_max = min(1000, n_max),
name_repair = "unique", num_threads = readr_threads(),
progress = show_progress(), show_col_types = should_show_types(),
skip_empty_rows = TRUE, lazy = should_read_lazy())
# file: Either a path to a file, a connection, or literal data.
# Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.
# Files starting with http://, https://, ftp://, or ftps:// will be downloaded.
# Using a value of clipboard() will read from the system clipboard.
4 / 32

>> 读入文本文档:readr

☑ 和内置 utils 包中 read.*() 比较

5 / 32

>> 读入文本文档:readr

☑ 和内置 utils 包中 read.*() 比较

  • 生成 tibble*(更确切地说应该是 spec_tbl_df

  • 默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(read.*()有个stringsAsFactors参数)、不会自动更改列名、不会将一列数据转化为行名等

  • 较少依赖系统和环境变量,结果更具可重现性(reproducible

  • 读入速度通常更快(~50x),且读入大文件时有进度条提示

5 / 32

>> 读入文本文档:readr

☑ 和内置 utils 包中 read.*() 比较

  • 生成 tibble*(更确切地说应该是 spec_tbl_df

  • 默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(read.*()有个stringsAsFactors参数)、不会自动更改列名、不会将一列数据转化为行名等

  • 较少依赖系统和环境变量,结果更具可重现性(reproducible

  • 读入速度通常更快(~50x),且读入大文件时有进度条提示

?`tbl_df-class`
tibble: a subclass of data.frame and the central data structure for the tidyverse
lazy and surly - do less and complain more than base data.frames.
======================================================================================
1. 列数据不进行自动类型转换(如字符->因子),且原生直接支持列表列
2. 只对长度为1的“标量”向量进行循环操作
3. tibble支持不合法的数据变量名,如 tibble(`:)` = "smile")
4. 不增加行名,也不鼓励使用行名来存储数据信息
5. 用 [ 对tibble取子集操作总是返回新的tibble,[[ 和 $ 总是返回向量,且 $ 不支持变量名部分匹配
6. tibble在屏幕上打印输出时会适应屏幕大小,并提供更多有用信息(类似str())
5 / 32

>> 读入文本文档:readr

常用参数示例:read_csv()

6 / 32

>> 读入文本文档:readr

常用参数示例:read_csv()

library(tidyverse)
read_csv("The 1st line of metadata
The 2nd line of metadata
# A comment to skip
x,y,z
1,2,3",
skip = 2,
comment = "#")
#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
6 / 32

>> 读入文本文档:readr

常用参数示例:read_csv()

library(tidyverse)
read_csv("The 1st line of metadata
The 2nd line of metadata
# A comment to skip
x,y,z
1,2,3",
skip = 2,
comment = "#")
#> Rows: 1 Columns: 3
#> ── Column specification ────────────────
#> Delimiter: ","
#> dbl (3): x, y, z
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
read_csv("1,2,3\n4,5,6")
#> # A tibble: 1 × 3
#> `1` `2` `3`
#> <dbl> <dbl> <dbl>
#> 1 4 5 6
read_csv("1,2,3\n4,5,6",
col_names = FALSE)
#> # A tibble: 2 × 3
#> X1 X2 X3
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
#> 2 4 5 6
read_csv("1,2,3\n4,5,6",
col_names = c("x", "y", "z"))
#> # A tibble: 2 × 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
#> 2 4 5 6
6 / 32

>> 读入文本文档:readr

readr 包采用启发式策略来解析文本文档中数据

# 生成示例数据文档
set.seed(123456) # 设定随机数种子
nycflights13::flights %>%
mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>%
select(month:dep_time, dep_delay, tailnum, time_hour) %>%
slice_sample(n = 20) %>%
mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>%
mutate(time_hour = lubridate::as_datetime(time_hour)) %>%
print(n = 5) %>%
write_csv(file = "data/ex_flights.csv", na = "--") # 默认na = "NA"
#> # A tibble: 20 × 6
#> month day dep_time dep_delay tailnum time_hour
#> <int> <int> <int> <lgl> <chr> <dttm>
#> 1 6 26 932 TRUE <NA> 2013-06-26 12:00:00
#> 2 NA 5 NA NA <NA> 2013-12-05 17:00:00
#> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00
#> 4 5 16 NA FALSE N27152 2013-05-16 12:00:00
#> 5 12 23 NA NA <NA> 2013-12-23 23:00:00
#> # … with 15 more rows
7 / 32

>> 读入文本文档:readr

readr 包采用启发式策略来解析文本文档中数据

(ex_flights <- read_csv("data/ex_flights.csv")) # 默认na = c("", "NA")
#> Rows: 20 Columns: 6
#> ── Column specification ──────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 20 × 6
#> month day dep_time dep_delay tailnum time_hour
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 6 26 932 TRUE -- 2013-06-26T12:00:00Z
#> 2 -- 5 -- -- -- 2013-12-05T17:00:00Z
#> 3 7 20 656 FALSE N17245 2013-07-20T10:00:00Z
#> 4 5 16 -- FALSE N27152 2013-05-16T12:00:00Z
#> 5 12 23 -- -- -- 2013-12-23T23:00:00Z
#> # … with 15 more rows
8 / 32

>> 读入文本文档:readr

readr 包采用启发式策略来解析文本文档中数据

1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
      2.1 由 col_types 参数直接给定
      2.2 猜测:读入文档 guess_max 行(v2 vs. v1),并按如下顺序猜测每列变量的类型:logical -> integer -> double -> number -> time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量

9 / 32

>> 读入文本文档:readr

readr 包采用启发式策略来解析文本文档中数据

1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
      2.1 由 col_types 参数直接给定
      2.2 猜测:读入文档 guess_max 行(v2 vs. v1),并按如下顺序猜测每列变量的类型:logical -> integer -> double -> number -> time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量

当发现 readr 自动解析碰到问题时

  • 可用 problems() 查看读入时遇到的问题

  • 根据需要设定 read_*() 的相关参数,如 na = "--"

  • 直接设定 read_*() 的参数 col_types = cols(...)

  • 将全部数据读入为字符型变量,然后再用 type_convert() 转换变量类型

  • 在后续的数据整理步骤中进行相应的处理(如 dplyr::muate() + parse_*()

9 / 32

>> 读入文本文档:readr

# 设定缺失值参数
read_csv("data/ex_flights.csv",
na = "--") %>% spec()
#> cols(
#> month = col_double(),
#> day = col_double(),
#> dep_time = col_double(),
#> dep_delay = col_logical(),
#> tailnum = col_character(),
#> time_hour = col_datetime(format = "")
#> )
# 直接设定参数col_types,?cols
ex_flights <- read_csv(
"data/ex_flights.csv",
col_types = cols( # cols_only()
dep_delay = col_logical(), # "l"
tailnum = col_character(), # "c"
time_hour = col_datetime(), # "T"
.default = col_integer()
) # col_types = "iiilcT"
)
#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#> dat <- vroom(...)
#> problems(dat)
10 / 32

>> 读入文本文档:readr

# 设定缺失值参数
read_csv("data/ex_flights.csv",
na = "--") %>% spec()
#> cols(
#> month = col_double(),
#> day = col_double(),
#> dep_time = col_double(),
#> dep_delay = col_logical(),
#> tailnum = col_character(),
#> time_hour = col_datetime(format = "")
#> )
# 直接设定参数col_types,?cols
ex_flights <- read_csv(
"data/ex_flights.csv",
col_types = cols( # cols_only()
dep_delay = col_logical(), # "l"
tailnum = col_character(), # "c"
time_hour = col_datetime(), # "T"
.default = col_integer()
) # col_types = "iiilcT"
)
#> Warning: One or more parsing issues, call `problems()` on
#> your data frame for details, e.g.:
#> dat <- vroom(...)
#> problems(dat)
problems(ex_flights)
#> # A tibble: 17 × 5
#> row col expected actual file
#> <int> <int> <chr> <chr> <chr>
#> 1 3 1 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> 2 3 3 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> 3 3 4 1/0/T/F/TRUE/FALSE -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…
#> # … with 14 more rows
# 全部读入,再行转换类型
read_csv(
"data/ex_flights.csv",
col_types = cols(
.default = col_character()
)
) %>%
type_convert(col_types = "iiilcT")
#> # A tibble: 20 × 6
#> month day dep_time dep_delay tailnum time_hour
#> <int> <int> <int> <lgl> <chr> <dttm>
#> 1 6 26 932 TRUE -- 2013-06-26 12:00:00
#> 2 NA 5 NA NA -- 2013-12-05 17:00:00
#> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00
#> # … with 17 more rows
10 / 32

>> 读入文本文档:readr

一次性读入多个相同性质的文档

# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>%
purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
11 / 32

>> 读入文本文档:readr

一次性读入多个相同性质的文档

# 生成相同性质的示例文档
nycflights13::flights %>% split(.$carrier) %>%
purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
# 文件路径向量
(files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE))
#> [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv"
#> [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv"
#> [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv"
#> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv"
#> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv"
#> [16] "data/flights_YV.tsv"
# 一次性读入(并纵向合并为 tibble)
read_tsv(files, id = "fpath") # 这里通过设置 id 参数将文件路径存入指定变量fpath中
#> # A tibble: 336,776 × 20
#> fpath year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 data/f… 2013 1 1 810 810 0 1048 1037 11 9E 3538
#> 2 data/f… 2013 1 1 1451 1500 -9 1634 1636 -2 9E 4105
#> 3 data/f… 2013 1 1 1452 1455 -3 1637 1639 -2 9E 3295
#> # … with 336,773 more rows, 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and
#> # abbreviated variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time,
#> # ⁵​sched_arr_time, ⁶​arr_delay
11 / 32

>> 读入文本文档:readr

readr 的图形化用户界面 *

* RStudio 右上 Environment 标签页 > Import Dataset > From Text (readr) ...

12 / 32

>> 读入其它类型的数据

13 / 32

>> 读入其它类型的数据

  • readr::read_rds(): R data files (.rds)
13 / 32

>> 读入其它类型的数据

  • readr::read_rds(): R data files (.rds)
13 / 32

>> 读入其它类型的数据

  • readr::read_rds(): R data files (.rds)
  • {{DBI}} with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL

  • jsonlite: JSON files

  • {{xml2}}: XML and HTML files

  • {{rvest}}: HTML (Web Scraping)

  • {{httr2}}: Web APIs

13 / 32

>> 读入其它类型的数据

  • readr::read_rds(): R data files (.rds)
  • {{DBI}} with specific interface packages: databases, such as SQLite, MySQL and PostgreSQL

  • jsonlite: JSON files

  • {{xml2}}: XML and HTML files

  • {{rvest}}: HTML (Web Scraping)

  • {{httr2}}: Web APIs


  • {{其它}}: text、network、spatial、genome、image 等类型的数据
13 / 32

2. 数据齐整

(Tidy Data)

14 / 32

>> 齐整数据

齐整数据的三个条件 📐

  • 每列都是一个变量(Every column is a variable.)

  • 每行都是一个观测(Every row is an observation.)

  • 每格都是一个取值(Every cell is a single value.)

15 / 32

>> 齐整数据

齐整数据的三个条件 📐

  • 每列都是一个变量(Every column is a variable.)

  • 每行都是一个观测(Every row is an observation.)

  • 每格都是一个取值(Every cell is a single value.)

15 / 32

>> 齐整数据

齐整数据的好处 💪

☑ 齐整数据按照逻辑一致的方式存储数据,这让你更容易学习并掌握相关工具对数据进行处理

☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势

tidyverse 中的 R 包(如 dplyr 等)在设计上要求输入数据为齐整数据

16 / 32

>> 齐整数据

但……并非所有的数据集都是齐整的,😭

table1
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> # … with 3 more rows
table2
#> # A tibble: 12 × 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> # … with 9 more rows
table3
#> # A tibble: 6 × 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> # … with 3 more rows
table4a # cases
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
table4b # population
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
17 / 32

>> 该 tidyr 包出场啦 🎉

tidyr包中的函数大致可分为5大类:

  1. 数据长型-宽型转换(pivoting):pivot_longer()pivot_wider()

  2. 分解或合并字符型变量:separate()separate_rows()extract()unite()

  3. 将深度嵌套的列表数据表格化(rectangling):unnest_longer()unnest_wider()unnest_auto()hoist()

  4. 将分组数据框转化成嵌套的数据框(每组占一行)或反之: nest()unnest()

  5. 缺失值处理:->

18 / 32

>> 该 tidyr 包出场啦 🎉

tidyr包中的函数大致可分为5大类:

  1. 数据长型-宽型转换(pivoting):pivot_longer()pivot_wider()

  2. 分解或合并字符型变量:separate()separate_rows()extract()unite()

  3. 将深度嵌套的列表数据表格化(rectangling):unnest_longer()unnest_wider()unnest_auto()hoist()

  4. 将分组数据框转化成嵌套的数据框(每组占一行)或反之: nest()unnest()

  5. 缺失值处理:->

  • complete(data, ..., fill = list()): Complete a data frame with missing combinations of data

  • expand(data, ...), crossing(...), nesting(...): Expand data frame to include all combinations of values

  • expand_grid(...): Create a tibble from all combinations of the inputted name-value pairs

  • full_seq(x, period, tol = 1e-06): Create the full sequence of values in a vector

  • drop_na(data, ...): Drop rows containing missing values

  • fill(data, ..., .direction = c("down", "up", "downup", "updown")): Fill in missing values with previous or next value

  • replace_na(data, replace, ...): Replace missing values with specified values

18 / 32

>> pivot_longer()pivot_wider()

19 / 32

>> pivot_longer()pivot_wider()

pivot_longer(
data, cols,
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = NULL,
names_transform = NULL,
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = NULL,
values_transform = NULL, ...
)
19 / 32

>> pivot_longer()pivot_wider()

pivot_longer(
data, cols,
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = NULL,
names_transform = NULL,
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = NULL,
values_transform = NULL, ...
)
table4a %>%
pivot_longer(
cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"
)
# table4b %>% ...
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> # … with 3 more rows
19 / 32

>> pivot_longer()pivot_wider()

20 / 32

>> pivot_longer()pivot_wider()

pivot_wider(
data,
id_cols = NULL, id_expand = FALSE,
names_from = name,
names_prefix = "", names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_vary = "fastest",
names_expand = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL, values_fn = NULL,
unused_fn = NULL, ...
)
20 / 32

>> pivot_longer()pivot_wider()

pivot_wider(
data,
id_cols = NULL, id_expand = FALSE,
names_from = name,
names_prefix = "", names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_vary = "fastest",
names_expand = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL, values_fn = NULL,
unused_fn = NULL, ...
)
table2 %>%
pivot_wider(
names_from = type,
values_from = count
)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> # … with 3 more rows
20 / 32

>> separate()extract()unite()

21 / 32

>> separate()extract()unite()

separate(
data, col, into,
sep = "[^[:alnum:]]+",
remove = TRUE, convert = FALSE,
extra = "warn", fill = "warn", ...
)
extract(
data, col, into,
regex = "([[:alnum:]]+)",
remove = TRUE, convert = FALSE, ...
)
separate_rows(data, ..., sep, convert)
21 / 32

>> separate()extract()unite()

separate(
data, col, into,
sep = "[^[:alnum:]]+",
remove = TRUE, convert = FALSE,
extra = "warn", fill = "warn", ...
)
extract(
data, col, into,
regex = "([[:alnum:]]+)",
remove = TRUE, convert = FALSE, ...
)
separate_rows(data, ..., sep, convert)
table3 %>%
separate(
rate,
into = c("cases", "population"),
convert = TRUE
)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> # … with 3 more rows
21 / 32

>> separate()extract()unite()

22 / 32

>> separate()extract()unite()

unite(
data,
col,
...,
sep = "_",
remove = TRUE,
na.rm = FALSE
)
22 / 32

>> separate()extract()unite()

unite(
data,
col,
...,
sep = "_",
remove = TRUE,
na.rm = FALSE
)
table5 %>%
unite(
col = "year",
century, year, sep = ""
)
#> # A tibble: 6 × 3
#> country year rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> # … with 3 more rows
22 / 32

>> 复杂例子

who # ?who
#> # A tibble: 7,240 × 60
#> country iso2 iso3 year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷
#> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 Afghanistan AF AFG 1980 NA NA NA NA NA NA NA
#> 2 Afghanistan AF AFG 1981 NA NA NA NA NA NA NA
#> 3 Afghanistan AF AFG 1982 NA NA NA NA NA NA NA
#> 4 Afghanistan AF AFG 1983 NA NA NA NA NA NA NA
#> 5 Afghanistan AF AFG 1984 NA NA NA NA NA NA NA
#> 6 Afghanistan AF AFG 1985 NA NA NA NA NA NA NA
#> 7 Afghanistan AF AFG 1986 NA NA NA NA NA NA NA
#> 8 Afghanistan AF AFG 1987 NA NA NA NA NA NA NA
#> 9 Afghanistan AF AFG 1988 NA NA NA NA NA NA NA
#> 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA NA
#> # … with 7,230 more rows, 49 more variables: new_sp_f014 <int>, new_sp_f1524 <int>,
#> # new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
#> # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
#> # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
#> # new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
#> # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
#> # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, …
23 / 32

>> 复杂例子

who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "name",
values_to = "cases",
values_drop_na = TRUE
)
who1
#> # A tibble: 76,046 × 6
#> country iso2 iso3 year name cases
#> <chr> <chr> <chr> <int> <chr> <int>
#> 1 Afghanistan AF AFG 1997 new_sp_m014 0
#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10
#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6
#> # … with 76,043 more rows
24 / 32

>> 复杂例子

who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "name",
values_to = "cases",
values_drop_na = TRUE
)
who1
#> # A tibble: 76,046 × 6
#> country iso2 iso3 year name cases
#> <chr> <chr> <chr> <int> <chr> <int>
#> 1 Afghanistan AF AFG 1997 new_sp_m014 0
#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10
#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6
#> # … with 76,043 more rows

变量 name 中的字符串代表什么?通过查阅 who 的数据字典得到如下信息:

  1. 前3个字母代表是否是“新增”肺结核病例,在数据中均为new
  2. 接下来2-3个字母代表肺结核的类型(spsneprel等)
  3. 接下来的字母mf代表肺结核病人的性别(男或女)
  4. 余下数字代表年龄组,如 014 = 0 - 14岁 | 1524 = 15 - 24岁
24 / 32

>> 复杂例子

who2 <- who1 %>%
mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符
who2
#> # A tibble: 76,046 × 7
#> country iso2 iso3 year type sexage cases
#> <chr> <chr> <chr> <int> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1997 sp m014 0
#> 2 Afghanistan AF AFG 1997 sp m1524 10
#> 3 Afghanistan AF AFG 1997 sp m2534 6
#> # … with 76,043 more rows
25 / 32

>> 复杂例子

who2 <- who1 %>%
mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符
who2
#> # A tibble: 76,046 × 7
#> country iso2 iso3 year type sexage cases
#> <chr> <chr> <chr> <int> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1997 sp m014 0
#> 2 Afghanistan AF AFG 1997 sp m1524 10
#> 3 Afghanistan AF AFG 1997 sp m2534 6
#> # … with 76,043 more rows
who3 <- who2 %>% separate(sexage, c("sex", "age"), sep = 1) # 基于位置
who3
#> # A tibble: 76,046 × 8
#> country iso2 iso3 year type sex age cases
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1997 sp m 014 0
#> 2 Afghanistan AF AFG 1997 sp m 1524 10
#> 3 Afghanistan AF AFG 1997 sp m 2534 6
#> # … with 76,043 more rows
25 / 32

>> 复杂例子

# all in a pipe!
who %>%
pivot_longer(cols = new_sp_m014:newrel_f65,
names_to = "name", values_to = "cases", values_drop_na = TRUE) %>%
mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
separate(name, c(NA, "type", "sexage")) %>%
separate(sexage, c("sex", "age"), sep = 1)
26 / 32

>> 复杂例子

# all in a pipe!
who %>%
pivot_longer(cols = new_sp_m014:newrel_f65,
names_to = "name", values_to = "cases", values_drop_na = TRUE) %>%
mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>%
separate(name, c(NA, "type", "sexage")) %>%
separate(sexage, c("sex", "age"), sep = 1)
# all with powerful pivot_longer()!
who %>%
pivot_longer(
cols = starts_with("new"),
names_to = c("type", "sex", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_ptypes = list(
sex = factor(levels = c("f", "m")),
age = factor(
levels = c("014", "1524", "2534",
"3544", "4554", "5564", "65"),
ordered = TRUE)),
values_to = "cases",
values_drop_na = TRUE
)
#> # A tibble: 76,046 × 8
#> country iso2 iso3 year type sex
#> <chr> <chr> <chr> <int> <chr> <fct>
#> 1 Afghani… AF AFG 1997 sp m
#> 2 Afghani… AF AFG 1997 sp m
#> 3 Afghani… AF AFG 1997 sp m
#> 4 Afghani… AF AFG 1997 sp m
#> 5 Afghani… AF AFG 1997 sp m
#> # … with 76,041 more rows, and 2 more
#> # variables: age <ord>, cases <int>


👉 vignette("pivot")

26 / 32

>> unnest_*hoist()

repurrrsive::gh_repos %>% View()

28 / 32

>> unnest_*hoist()

repurrrsive::gh_repos %>% View()

(repos <-
tibble(repo = repurrrsive::gh_repos))
#> # A tibble: 6 × 1
#> repo
#> <list>
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> # … with 1 more row
(repos <- repos %>% unnest_longer(repo))
#> # A tibble: 176 × 1
#> repo
#> <list>
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> # … with 171 more rows
28 / 32

>> unnest_*hoist()

repos %>%
unnest_wider(repo)
#> # A tibble: 176 × 68
#> id name full_…¹ owner
#> <int> <chr> <chr> <list>
#> 1 61160198 after gaborc… <named list>
#> 2 40500181 argufy gaborc… <named list>
#> 3 36442442 ask gaborc… <named list>
#> 4 34924886 baseimp… gaborc… <named list>
#> 5 61620661 citest gaborc… <named list>
#> # … with 171 more rows, 64 more
#> # variables: private <lgl>,
#> # html_url <chr>, description <chr>,
#> # fork <lgl>, url <chr>,
#> # forks_url <chr>, keys_url <chr>,
#> # collaborators_url <chr>,
#> # teams_url <chr>, hooks_url <chr>, …
29 / 32

>> unnest_*hoist()

repos %>%
unnest_wider(repo)
#> # A tibble: 176 × 68
#> id name full_…¹ owner
#> <int> <chr> <chr> <list>
#> 1 61160198 after gaborc… <named list>
#> 2 40500181 argufy gaborc… <named list>
#> 3 36442442 ask gaborc… <named list>
#> 4 34924886 baseimp… gaborc… <named list>
#> 5 61620661 citest gaborc… <named list>
#> # … with 171 more rows, 64 more
#> # variables: private <lgl>,
#> # html_url <chr>, description <chr>,
#> # fork <lgl>, url <chr>,
#> # forks_url <chr>, keys_url <chr>,
#> # collaborators_url <chr>,
#> # teams_url <chr>, hooks_url <chr>, …
repos %>% hoist(
repo,
login = c("owner", "login"),
name = "name",
homepage = "homepage",
watchers = "watchers_count"
)
#> # A tibble: 176 × 5
#> login name homep…¹ watch…²
#> <chr> <chr> <chr> <int>
#> 1 gaborcsardi after <NA> 5
#> 2 gaborcsardi argufy <NA> 19
#> 3 gaborcsardi ask <NA> 5
#> 4 gaborcsardi baseimpor… <NA> 0
#> 5 gaborcsardi citest <NA> 0
#> # … with 171 more rows, 1 more
#> # variable: repo <list>, and
#> # abbreviated variable names
#> # ¹​homepage, ²​watchers


👉 vignette("rectangle")

29 / 32

课后作业

30 / 32


1. 复习 📖 R for Data Science 一书的第10章、第11章和第12章的内容

2. 下载(打印) 📰 {{Data Import的cheatsheet}}{{Data Tidying的cheatsheet}} 并阅读之

3. browseVignettes(package = c("readr", "tidyr")) 📝

Introduction to readr
Tidy data
Pivoting

4. 完成第六讲的课后练习 👩‍💻

remotes::install_github("qfwr2022/qfwr")
# 记得选择允许安装qfwr包所依赖的learnr、testwhat等包
library(qfwr)
qfwr_ex("L06")
31 / 32

问渠那得清如许?为有源头活水来。——[宋]朱熹

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow