问渠那得清如许?为有源头活水来。——[宋]朱熹
问渠那得清如许?为有源头活水来。——[宋]朱熹
Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham
问渠那得清如许?为有源头活水来。——[宋]朱熹
Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham
1. 数据导入
readr
包 2.1.3问渠那得清如许?为有源头活水来。——[宋]朱熹
Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham
1. 数据导入
readr
包 2.1.32. 数据齐整
tidyr
包 1.2.1pivot_longer()
和 pivot_wider()
separate()
和 unite()
unnest_*()
和 hoist()
(Data Import)
readr
包☑ 一致的函数命名方式
read_csv() | read_csv2() | read_tsv() | read_delim() |read_table() | read_fwf() |read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() |
readr
包☑ 一致的函数命名方式
read_csv() | read_csv2() | read_tsv() | read_delim() |read_table() | read_fwf() |read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() |
☑ 基本一致的参数设置
read_*(file, # delim = NULL, escape_backslash = FALSE, escape_double = TRUE, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy())# file: Either a path to a file, a connection, or literal data. # Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.# Files starting with http://, https://, ftp://, or ftps:// will be downloaded.# Using a value of clipboard() will read from the system clipboard.
readr
包☑ 和内置 utils
包中 read.*()
比较
readr
包☑ 和内置 utils
包中 read.*()
比较
生成 tibble
*(更确切地说应该是 spec_tbl_df
)
默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(read.*()
有个stringsAsFactors
参数)、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量,结果更具可重现性(reproducible)
读入速度通常更快(~50x),且读入大文件时有进度条提示
readr
包☑ 和内置 utils
包中 read.*()
比较
生成 tibble
*(更确切地说应该是 spec_tbl_df
)
默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(read.*()
有个stringsAsFactors
参数)、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量,结果更具可重现性(reproducible)
读入速度通常更快(~50x),且读入大文件时有进度条提示
?`tbl_df-class`tibble: a subclass of data.frame and the central data structure for the tidyverselazy and surly - do less and complain more than base data.frames.======================================================================================1. 列数据不进行自动类型转换(如字符->因子),且原生直接支持列表列2. 只对长度为1的“标量”向量进行循环操作3. tibble支持不合法的数据变量名,如 tibble(`:)` = "smile")4. 不增加行名,也不鼓励使用行名来存储数据信息5. 用 [ 对tibble取子集操作总是返回新的tibble,[[ 和 $ 总是返回向量,且 $ 不支持变量名部分匹配6. tibble在屏幕上打印输出时会适应屏幕大小,并提供更多有用信息(类似str())
readr
包常用参数示例:read_csv()
readr
包常用参数示例:read_csv()
library(tidyverse)read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", skip = 2, comment = "#")
#> Rows: 1 Columns: 3#> ── Column specification ────────────────#> Delimiter: ","#> dbl (3): x, y, z#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3#> x y z#> <dbl> <dbl> <dbl>#> 1 1 2 3
readr
包常用参数示例:read_csv()
library(tidyverse)read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", skip = 2, comment = "#")
#> Rows: 1 Columns: 3#> ── Column specification ────────────────#> Delimiter: ","#> dbl (3): x, y, z#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3#> x y z#> <dbl> <dbl> <dbl>#> 1 1 2 3
read_csv("1,2,3\n4,5,6")
#> # A tibble: 1 × 3#> `1` `2` `3`#> <dbl> <dbl> <dbl>#> 1 4 5 6
read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 × 3#> X1 X2 X3#> <dbl> <dbl> <dbl>#> 1 1 2 3#> 2 4 5 6
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
#> # A tibble: 2 × 3#> x y z#> <dbl> <dbl> <dbl>#> 1 1 2 3#> 2 4 5 6
readr
包readr
包采用启发式策略来解析文本文档中数据
# 生成示例数据文档set.seed(123456) # 设定随机数种子nycflights13::flights %>% mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% select(month:dep_time, dep_delay, tailnum, time_hour) %>% slice_sample(n = 20) %>% mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% mutate(time_hour = lubridate::as_datetime(time_hour)) %>% print(n = 5) %>% write_csv(file = "data/ex_flights.csv", na = "--") # 默认na = "NA"
#> # A tibble: 20 × 6#> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE <NA> 2013-06-26 12:00:00#> 2 NA 5 NA NA <NA> 2013-12-05 17:00:00#> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00#> 4 5 16 NA FALSE N27152 2013-05-16 12:00:00#> 5 12 23 NA NA <NA> 2013-12-23 23:00:00#> # … with 15 more rows
readr
包readr
包采用启发式策略来解析文本文档中数据
(ex_flights <- read_csv("data/ex_flights.csv")) # 默认na = c("", "NA")
#> Rows: 20 Columns: 6#> ── Column specification ──────────────────────────────────────────────────────────────────#> Delimiter: ","#> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 20 × 6#> month day dep_time dep_delay tailnum time_hour #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 6 26 932 TRUE -- 2013-06-26T12:00:00Z#> 2 -- 5 -- -- -- 2013-12-05T17:00:00Z#> 3 7 20 656 FALSE N17245 2013-07-20T10:00:00Z#> 4 5 16 -- FALSE N27152 2013-05-16T12:00:00Z#> 5 12 23 -- -- -- 2013-12-23T23:00:00Z#> # … with 15 more rows
readr
包readr
包采用启发式策略来解析文本文档中数据
1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由col_types
参数直接给定
2.2 猜测:读入文档guess_max
行(v2 vs. v1),并按如下顺序猜测每列变量的类型:logical ->integer ->double ->number ->time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量
readr
包readr
包采用启发式策略来解析文本文档中数据
1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由col_types
参数直接给定
2.2 猜测:读入文档guess_max
行(v2 vs. v1),并按如下顺序猜测每列变量的类型:logical ->integer ->double ->number ->time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量
当发现 readr
自动解析碰到问题时
可用 problems()
查看读入时遇到的问题
根据需要设定 read_*()
的相关参数,如 na = "--"
直接设定 read_*()
的参数 col_types = cols(...)
将全部数据读入为字符型变量,然后再用 type_convert()
转换变量类型
在后续的数据整理步骤中进行相应的处理(如 dplyr::muate() + parse_*()
)
readr
包# 设定缺失值参数read_csv("data/ex_flights.csv", na = "--") %>% spec()
#> cols(#> month = col_double(),#> day = col_double(),#> dep_time = col_double(),#> dep_delay = col_logical(),#> tailnum = col_character(),#> time_hour = col_datetime(format = "")#> )
# 直接设定参数col_types,?colsex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() ) # col_types = "iiilcT")
#> Warning: One or more parsing issues, call `problems()` on#> your data frame for details, e.g.:#> dat <- vroom(...)#> problems(dat)
readr
包# 设定缺失值参数read_csv("data/ex_flights.csv", na = "--") %>% spec()
#> cols(#> month = col_double(),#> day = col_double(),#> dep_time = col_double(),#> dep_delay = col_logical(),#> tailnum = col_character(),#> time_hour = col_datetime(format = "")#> )
# 直接设定参数col_types,?colsex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() ) # col_types = "iiilcT")
#> Warning: One or more parsing issues, call `problems()` on#> your data frame for details, e.g.:#> dat <- vroom(...)#> problems(dat)
problems(ex_flights)
#> # A tibble: 17 × 5#> row col expected actual file #> <int> <int> <chr> <chr> <chr> #> 1 3 1 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…#> 2 3 3 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…#> 3 3 4 1/0/T/F/TRUE/FALSE -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…#> # … with 14 more rows
# 全部读入,再行转换类型read_csv( "data/ex_flights.csv", col_types = cols( .default = col_character() )) %>% type_convert(col_types = "iiilcT")
#> # A tibble: 20 × 6#> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE -- 2013-06-26 12:00:00#> 2 NA 5 NA NA -- 2013-12-05 17:00:00#> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00#> # … with 17 more rows
readr
包一次性读入多个相同性质的文档
# 生成相同性质的示例文档nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
readr
包一次性读入多个相同性质的文档
# 生成相同性质的示例文档nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
# 文件路径向量(files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE))
#> [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv"#> [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv"#> [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv"#> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv"#> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv"#> [16] "data/flights_YV.tsv"
# 一次性读入(并纵向合并为 tibble)read_tsv(files, id = "fpath") # 这里通过设置 id 参数将文件路径存入指定变量fpath中
#> # A tibble: 336,776 × 20#> fpath year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>#> 1 data/f… 2013 1 1 810 810 0 1048 1037 11 9E 3538#> 2 data/f… 2013 1 1 1451 1500 -9 1634 1636 -2 9E 4105#> 3 data/f… 2013 1 1 1452 1455 -3 1637 1639 -2 9E 3295#> # … with 336,773 more rows, 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and#> # abbreviated variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,#> # ⁵sched_arr_time, ⁶arr_delay
readr
包readr
的图形化用户界面 *
* RStudio
右上 Environment
标签页 > Import Dataset
> From Text (readr)
...
readr::read_rds()
: R
data files (.rds)readr::read_rds()
: R
data files (.rds)
{{: Read and Write Rectangular Text Data Quicklyvroom
}}
{{
readxl
}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{
haven
}}: SPSS, Stata, and SAS files
{{
arrow
}}: Apache Arrow
readr::read_rds()
: R
data files (.rds)
{{: Read and Write Rectangular Text Data Quicklyvroom
}}
{{
readxl
}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{
haven
}}: SPSS, Stata, and SAS files
{{
arrow
}}: Apache Arrow
readr::read_rds()
: R
data files (.rds)
{{: Read and Write Rectangular Text Data Quicklyvroom
}}
{{
readxl
}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{
haven
}}: SPSS, Stata, and SAS files
{{
arrow
}}: Apache Arrow
(Tidy Data)
齐整数据的三个条件 📐
每列都是一个变量(Every column is a variable.)
每行都是一个观测(Every row is an observation.)
每格都是一个取值(Every cell is a single value.)
齐整数据的三个条件 📐
每列都是一个变量(Every column is a variable.)
每行都是一个观测(Every row is an observation.)
每格都是一个取值(Every cell is a single value.)
齐整数据的好处 💪
☑ 齐整数据按照逻辑一致的方式存储数据,这让你更容易学习并掌握相关工具对数据进行处理
☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势
☑ tidyverse
中的 R 包(如 dplyr
等)在设计上要求输入数据为齐整数据
但……并非所有的数据集都是齐整的,😭
table1
#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> # … with 3 more rows
table2
#> # A tibble: 12 × 4#> country year type count#> <chr> <int> <chr> <int>#> 1 Afghanistan 1999 cases 745#> 2 Afghanistan 1999 population 19987071#> 3 Afghanistan 2000 cases 2666#> # … with 9 more rows
table3
#> # A tibble: 6 × 3#> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362#> # … with 3 more rows
table4a # cases
#> # A tibble: 3 × 3#> country `1999` `2000`#> * <chr> <int> <int>#> 1 Afghanistan 745 2666#> 2 Brazil 37737 80488#> 3 China 212258 213766
table4b # population
#> # A tibble: 3 × 3#> country `1999` `2000`#> * <chr> <int> <int>#> 1 Afghanistan 19987071 20595360#> 2 Brazil 172006362 174504898#> 3 China 1272915272 1280428583
tidyr
包出场啦 🎉tidyr
包中的函数大致可分为5大类:
数据长型-宽型转换(pivoting):pivot_longer()
和 pivot_wider()
分解或合并字符型变量:separate()
、separate_rows()
、extract()
和 unite()
将深度嵌套的列表数据表格化(rectangling):unnest_longer()
、unnest_wider()
、unnest_auto()
和 hoist()
将分组数据框转化成嵌套的数据框(每组占一行)或反之: nest()
和 unnest()
缺失值处理:->
tidyr
包出场啦 🎉tidyr
包中的函数大致可分为5大类:
数据长型-宽型转换(pivoting):pivot_longer()
和 pivot_wider()
分解或合并字符型变量:separate()
、separate_rows()
、extract()
和 unite()
将深度嵌套的列表数据表格化(rectangling):unnest_longer()
、unnest_wider()
、unnest_auto()
和 hoist()
将分组数据框转化成嵌套的数据框(每组占一行)或反之: nest()
和 unnest()
缺失值处理:->
complete(data, ..., fill = list())
: Complete a data frame with missing combinations of data
expand(data, ...)
, crossing(...)
, nesting(...)
: Expand data frame to include all combinations of values
expand_grid(...)
: Create a tibble from all combinations of the inputted name-value pairs
full_seq(x, period, tol = 1e-06)
: Create the full sequence of values in a vector
drop_na(data, ...)
: Drop rows containing missing values
fill(data, ..., .direction = c("down", "up", "downup", "updown"))
: Fill in missing values with previous or next value
replace_na(data, replace, ...)
: Replace missing values with specified values
pivot_longer()
和 pivot_wider()
pivot_longer()
和 pivot_wider()
pivot_longer( data, cols, names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, ...)
pivot_longer()
和 pivot_wider()
pivot_longer( data, cols, names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, ...)
table4a %>% pivot_longer( cols = c(`1999`, `2000`), names_to = "year", values_to = "cases" )# table4b %>% ...
#> # A tibble: 6 × 3#> country year cases#> <chr> <chr> <int>#> 1 Afghanistan 1999 745#> 2 Afghanistan 2000 2666#> 3 Brazil 1999 37737#> # … with 3 more rows
pivot_longer()
和 pivot_wider()
pivot_longer()
和 pivot_wider()
pivot_wider( data, id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, ...)
pivot_longer()
和 pivot_wider()
pivot_wider( data, id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, ...)
table2 %>% pivot_wider( names_from = type, values_from = count )
#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> # … with 3 more rows
separate()
、extract()
和 unite()
separate()
、extract()
和 unite()
separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ...)separate_rows(data, ..., sep, convert)
separate()
、extract()
和 unite()
separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ...)separate_rows(data, ..., sep, convert)
table3 %>% separate( rate, into = c("cases", "population"), convert = TRUE )
#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> # … with 3 more rows
separate()
、extract()
和 unite()
separate()
、extract()
和 unite()
unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
separate()
、extract()
和 unite()
unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
table5 %>% unite( col = "year", century, year, sep = "" )
#> # A tibble: 6 × 3#> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362#> # … with 3 more rows
who # ?who
#> # A tibble: 7,240 × 60#> country iso2 iso3 year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷#> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>#> 1 Afghanistan AF AFG 1980 NA NA NA NA NA NA NA#> 2 Afghanistan AF AFG 1981 NA NA NA NA NA NA NA#> 3 Afghanistan AF AFG 1982 NA NA NA NA NA NA NA#> 4 Afghanistan AF AFG 1983 NA NA NA NA NA NA NA#> 5 Afghanistan AF AFG 1984 NA NA NA NA NA NA NA#> 6 Afghanistan AF AFG 1985 NA NA NA NA NA NA NA#> 7 Afghanistan AF AFG 1986 NA NA NA NA NA NA NA#> 8 Afghanistan AF AFG 1987 NA NA NA NA NA NA NA#> 9 Afghanistan AF AFG 1988 NA NA NA NA NA NA NA#> 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA NA#> # … with 7,230 more rows, 49 more variables: new_sp_f014 <int>, new_sp_f1524 <int>,#> # new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,#> # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,#> # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,#> # new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,#> # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,#> # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, …
who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE )who1
#> # A tibble: 76,046 × 6#> country iso2 iso3 year name cases#> <chr> <chr> <chr> <int> <chr> <int>#> 1 Afghanistan AF AFG 1997 new_sp_m014 0#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6#> # … with 76,043 more rows
who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE )who1
#> # A tibble: 76,046 × 6#> country iso2 iso3 year name cases#> <chr> <chr> <chr> <int> <chr> <int>#> 1 Afghanistan AF AFG 1997 new_sp_m014 0#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6#> # … with 76,043 more rows
变量
name
中的字符串代表什么?通过查阅who
的数据字典得到如下信息:
- 前3个字母代表是否是“新增”肺结核病例,在数据中均为
new
- 接下来2-3个字母代表肺结核的类型(
sp
、sn
、ep
、rel
等)- 接下来的字母
m
或f
代表肺结核病人的性别(男或女)- 余下数字代表年龄组,如
014 = 0 - 14岁 | 1524 = 15 - 24岁
who2 <- who1 %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符who2
#> # A tibble: 76,046 × 7#> country iso2 iso3 year type sexage cases#> <chr> <chr> <chr> <int> <chr> <chr> <int>#> 1 Afghanistan AF AFG 1997 sp m014 0#> 2 Afghanistan AF AFG 1997 sp m1524 10#> 3 Afghanistan AF AFG 1997 sp m2534 6#> # … with 76,043 more rows
who2 <- who1 %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符who2
#> # A tibble: 76,046 × 7#> country iso2 iso3 year type sexage cases#> <chr> <chr> <chr> <int> <chr> <chr> <int>#> 1 Afghanistan AF AFG 1997 sp m014 0#> 2 Afghanistan AF AFG 1997 sp m1524 10#> 3 Afghanistan AF AFG 1997 sp m2534 6#> # … with 76,043 more rows
who3 <- who2 %>% separate(sexage, c("sex", "age"), sep = 1) # 基于位置who3
#> # A tibble: 76,046 × 8#> country iso2 iso3 year type sex age cases#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>#> 1 Afghanistan AF AFG 1997 sp m 014 0#> 2 Afghanistan AF AFG 1997 sp m 1524 10#> 3 Afghanistan AF AFG 1997 sp m 2534 6#> # … with 76,043 more rows
# all in a pipe!who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1)
# all in a pipe!who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1)
# all with powerful pivot_longer()!who %>% pivot_longer( cols = starts_with("new"), names_to = c("type", "sex", "age"), names_pattern = "new_?(.*)_(.)(.*)", names_ptypes = list( sex = factor(levels = c("f", "m")), age = factor( levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), ordered = TRUE)), values_to = "cases", values_drop_na = TRUE )
#> # A tibble: 76,046 × 8#> country iso2 iso3 year type sex #> <chr> <chr> <chr> <int> <chr> <fct>#> 1 Afghani… AF AFG 1997 sp m #> 2 Afghani… AF AFG 1997 sp m #> 3 Afghani… AF AFG 1997 sp m #> 4 Afghani… AF AFG 1997 sp m #> 5 Afghani… AF AFG 1997 sp m #> # … with 76,041 more rows, and 2 more#> # variables: age <ord>, cases <int>
👉 vignette("pivot")
05:00
在空白处填入恰当的代码将数据集 billboard 转化为目标结构:
library(tidyverse)
billboard %>% View()
billboard %>%
pivot_ (
cols = ,
names_to = ,
,
names_transform = ,
values_to = ,
values_drop_na = TRUE
)
# A tibble: 317 × 79 artist track date.ent…¹ wk1 wk2 <chr> <chr> <date> <dbl> <dbl>1 2 Pac Baby… 2000-02-26 87 822 2Ge+her The … 2000-09-02 91 873 3 Doors … Kryp… 2000-04-08 81 70# … with 314 more rows, 74 more# variables: wk3 <dbl>, wk4 <dbl>,# wk5 <dbl>, wk6 <dbl>, wk7 <dbl>,# wk8 <dbl>, wk9 <dbl>, wk10 <dbl>,# wk11 <dbl>, wk12 <dbl>, wk13 <dbl>,# wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,# wk17 <dbl>, wk18 <dbl>, …
# A tibble: 5,307 × 5 artist track date.ent…¹ week rank <chr> <chr> <date> <int> <dbl>1 2 Pac Baby Do… 2000-02-26 1 872 2 Pac Baby Do… 2000-02-26 2 823 2 Pac Baby Do… 2000-02-26 3 72# … with 5,304 more rows, and# abbreviated variable name# ¹date.entered
library(tidyverse)billboard %>% View()billboard %>% pivot_longer( cols = starts_with("wk") , names_to = "week" , names_prefix = "wk" , names_transform = list(week = as.integer), values_to = "rank" , values_drop_na = TRUE )
unnest_*
和 hoist()
repurrrsive::gh_repos %>% View()
unnest_*
和 hoist()
repurrrsive::gh_repos %>% View()
(repos <- tibble(repo = repurrrsive::gh_repos))
#> # A tibble: 6 × 1#> repo #> <list> #> 1 <list [30]>#> 2 <list [30]>#> 3 <list [30]>#> 4 <list [26]>#> 5 <list [30]>#> # … with 1 more row
(repos <- repos %>% unnest_longer(repo))
#> # A tibble: 176 × 1#> repo #> <list> #> 1 <named list [68]>#> 2 <named list [68]>#> 3 <named list [68]>#> 4 <named list [68]>#> 5 <named list [68]>#> # … with 171 more rows
unnest_*
和 hoist()
repos %>% unnest_wider(repo)
#> # A tibble: 176 × 68#> id name full_…¹ owner #> <int> <chr> <chr> <list> #> 1 61160198 after gaborc… <named list>#> 2 40500181 argufy gaborc… <named list>#> 3 36442442 ask gaborc… <named list>#> 4 34924886 baseimp… gaborc… <named list>#> 5 61620661 citest gaborc… <named list>#> # … with 171 more rows, 64 more#> # variables: private <lgl>,#> # html_url <chr>, description <chr>,#> # fork <lgl>, url <chr>,#> # forks_url <chr>, keys_url <chr>,#> # collaborators_url <chr>,#> # teams_url <chr>, hooks_url <chr>, …
unnest_*
和 hoist()
repos %>% unnest_wider(repo)
#> # A tibble: 176 × 68#> id name full_…¹ owner #> <int> <chr> <chr> <list> #> 1 61160198 after gaborc… <named list>#> 2 40500181 argufy gaborc… <named list>#> 3 36442442 ask gaborc… <named list>#> 4 34924886 baseimp… gaborc… <named list>#> 5 61620661 citest gaborc… <named list>#> # … with 171 more rows, 64 more#> # variables: private <lgl>,#> # html_url <chr>, description <chr>,#> # fork <lgl>, url <chr>,#> # forks_url <chr>, keys_url <chr>,#> # collaborators_url <chr>,#> # teams_url <chr>, hooks_url <chr>, …
repos %>% hoist( repo, login = c("owner", "login"), name = "name", homepage = "homepage", watchers = "watchers_count")
#> # A tibble: 176 × 5#> login name homep…¹ watch…²#> <chr> <chr> <chr> <int>#> 1 gaborcsardi after <NA> 5#> 2 gaborcsardi argufy <NA> 19#> 3 gaborcsardi ask <NA> 5#> 4 gaborcsardi baseimpor… <NA> 0#> 5 gaborcsardi citest <NA> 0#> # … with 171 more rows, 1 more#> # variable: repo <list>, and#> # abbreviated variable names#> # ¹homepage, ²watchers
👉 vignette("rectangle")
1. 复习 📖 R for Data Science 一书的第10章、第11章和第12章的内容
2. 下载(打印) 📰 {{Data Import的cheatsheet}} 和 {{Data Tidying的cheatsheet}} 并阅读之
3. browseVignettes(package = c("readr", "tidyr"))
📝
Introduction to readrTidy dataPivoting
4. 完成第六讲的课后练习 👩💻
remotes::install_github("qfwr2022/qfwr")# 记得选择允许安装qfwr包所依赖的learnr、testwhat等包library(qfwr)qfwr_ex("L06")
问渠那得清如许?为有源头活水来。——[宋]朱熹
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
问渠那得清如许?为有源头活水来。——[宋]朱熹
问渠那得清如许?为有源头活水来。——[宋]朱熹
Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham
问渠那得清如许?为有源头活水来。——[宋]朱熹
Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham
1. 数据导入
readr
包 2.1.3问渠那得清如许?为有源头活水来。——[宋]朱熹
Like families, tidy datasets are all alike but every messy
dataset is messy in its own way. ——Hadley Wickham
1. 数据导入
readr
包 2.1.32. 数据齐整
tidyr
包 1.2.1pivot_longer()
和 pivot_wider()
separate()
和 unite()
unnest_*()
和 hoist()
(Data Import)
readr
包☑ 一致的函数命名方式
read_csv() | read_csv2() | read_tsv() | read_delim() |read_table() | read_fwf() |read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() |
readr
包☑ 一致的函数命名方式
read_csv() | read_csv2() | read_tsv() | read_delim() |read_table() | read_fwf() |read_lines() | read_lines_raw() | read_file() | read_file_raw() | read_log() |
☑ 基本一致的参数设置
read_*(file, # delim = NULL, escape_backslash = FALSE, escape_double = TRUE, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy())# file: Either a path to a file, a connection, or literal data. # Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.# Files starting with http://, https://, ftp://, or ftps:// will be downloaded.# Using a value of clipboard() will read from the system clipboard.
readr
包☑ 和内置 utils
包中 read.*()
比较
readr
包☑ 和内置 utils
包中 read.*()
比较
生成 tibble
*(更确切地说应该是 spec_tbl_df
)
默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(read.*()
有个stringsAsFactors
参数)、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量,结果更具可重现性(reproducible)
读入速度通常更快(~50x),且读入大文件时有进度条提示
readr
包☑ 和内置 utils
包中 read.*()
比较
生成 tibble
*(更确切地说应该是 spec_tbl_df
)
默认情况下做得更少,如不会自动将字符型变量转换为因子型变量(read.*()
有个stringsAsFactors
参数)、不会自动更改列名、不会将一列数据转化为行名等
较少依赖系统和环境变量,结果更具可重现性(reproducible)
读入速度通常更快(~50x),且读入大文件时有进度条提示
?`tbl_df-class`tibble: a subclass of data.frame and the central data structure for the tidyverselazy and surly - do less and complain more than base data.frames.======================================================================================1. 列数据不进行自动类型转换(如字符->因子),且原生直接支持列表列2. 只对长度为1的“标量”向量进行循环操作3. tibble支持不合法的数据变量名,如 tibble(`:)` = "smile")4. 不增加行名,也不鼓励使用行名来存储数据信息5. 用 [ 对tibble取子集操作总是返回新的tibble,[[ 和 $ 总是返回向量,且 $ 不支持变量名部分匹配6. tibble在屏幕上打印输出时会适应屏幕大小,并提供更多有用信息(类似str())
readr
包常用参数示例:read_csv()
readr
包常用参数示例:read_csv()
library(tidyverse)read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", skip = 2, comment = "#")
#> Rows: 1 Columns: 3#> ── Column specification ────────────────#> Delimiter: ","#> dbl (3): x, y, z#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3#> x y z#> <dbl> <dbl> <dbl>#> 1 1 2 3
readr
包常用参数示例:read_csv()
library(tidyverse)read_csv("The 1st line of metadata The 2nd line of metadata # A comment to skip x,y,z 1,2,3", skip = 2, comment = "#")
#> Rows: 1 Columns: 3#> ── Column specification ────────────────#> Delimiter: ","#> dbl (3): x, y, z#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3#> x y z#> <dbl> <dbl> <dbl>#> 1 1 2 3
read_csv("1,2,3\n4,5,6")
#> # A tibble: 1 × 3#> `1` `2` `3`#> <dbl> <dbl> <dbl>#> 1 4 5 6
read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 × 3#> X1 X2 X3#> <dbl> <dbl> <dbl>#> 1 1 2 3#> 2 4 5 6
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
#> # A tibble: 2 × 3#> x y z#> <dbl> <dbl> <dbl>#> 1 1 2 3#> 2 4 5 6
readr
包readr
包采用启发式策略来解析文本文档中数据
# 生成示例数据文档set.seed(123456) # 设定随机数种子nycflights13::flights %>% mutate(dep_delay = if_else(dep_delay <= 0, FALSE, TRUE)) %>% select(month:dep_time, dep_delay, tailnum, time_hour) %>% slice_sample(n = 20) %>% mutate(across(everything(), ~ifelse(runif(20) <= 0.1, NA, .))) %>% mutate(time_hour = lubridate::as_datetime(time_hour)) %>% print(n = 5) %>% write_csv(file = "data/ex_flights.csv", na = "--") # 默认na = "NA"
#> # A tibble: 20 × 6#> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE <NA> 2013-06-26 12:00:00#> 2 NA 5 NA NA <NA> 2013-12-05 17:00:00#> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00#> 4 5 16 NA FALSE N27152 2013-05-16 12:00:00#> 5 12 23 NA NA <NA> 2013-12-23 23:00:00#> # … with 15 more rows
readr
包readr
包采用启发式策略来解析文本文档中数据
(ex_flights <- read_csv("data/ex_flights.csv")) # 默认na = c("", "NA")
#> Rows: 20 Columns: 6#> ── Column specification ──────────────────────────────────────────────────────────────────#> Delimiter: ","#> chr (6): month, day, dep_time, dep_delay, tailnum, time_hour#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 20 × 6#> month day dep_time dep_delay tailnum time_hour #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 6 26 932 TRUE -- 2013-06-26T12:00:00Z#> 2 -- 5 -- -- -- 2013-12-05T17:00:00Z#> 3 7 20 656 FALSE N17245 2013-07-20T10:00:00Z#> 4 5 16 -- FALSE N27152 2013-05-16T12:00:00Z#> 5 12 23 -- -- -- 2013-12-23T23:00:00Z#> # … with 15 more rows
readr
包readr
包采用启发式策略来解析文本文档中数据
1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由col_types
参数直接给定
2.2 猜测:读入文档guess_max
行(v2 vs. v1),并按如下顺序猜测每列变量的类型:logical ->integer ->double ->number ->time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量
readr
包readr
包采用启发式策略来解析文本文档中数据
1. 根据给定或猜测的分隔符将文本文档分割为字符串矩阵
2. 确定每列字符串向量的类型
2.1 由col_types
参数直接给定
2.2 猜测:读入文档guess_max
行(v2 vs. v1),并按如下顺序猜测每列变量的类型:logical ->integer ->double ->number ->time -> date -> date-time -> character
3. 将每列字符串解析为相应类型的向量
当发现 readr
自动解析碰到问题时
可用 problems()
查看读入时遇到的问题
根据需要设定 read_*()
的相关参数,如 na = "--"
直接设定 read_*()
的参数 col_types = cols(...)
将全部数据读入为字符型变量,然后再用 type_convert()
转换变量类型
在后续的数据整理步骤中进行相应的处理(如 dplyr::muate() + parse_*()
)
readr
包# 设定缺失值参数read_csv("data/ex_flights.csv", na = "--") %>% spec()
#> cols(#> month = col_double(),#> day = col_double(),#> dep_time = col_double(),#> dep_delay = col_logical(),#> tailnum = col_character(),#> time_hour = col_datetime(format = "")#> )
# 直接设定参数col_types,?colsex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() ) # col_types = "iiilcT")
#> Warning: One or more parsing issues, call `problems()` on#> your data frame for details, e.g.:#> dat <- vroom(...)#> problems(dat)
readr
包# 设定缺失值参数read_csv("data/ex_flights.csv", na = "--") %>% spec()
#> cols(#> month = col_double(),#> day = col_double(),#> dep_time = col_double(),#> dep_delay = col_logical(),#> tailnum = col_character(),#> time_hour = col_datetime(format = "")#> )
# 直接设定参数col_types,?colsex_flights <- read_csv( "data/ex_flights.csv", col_types = cols( # cols_only() dep_delay = col_logical(), # "l" tailnum = col_character(), # "c" time_hour = col_datetime(), # "T" .default = col_integer() ) # col_types = "iiilcT")
#> Warning: One or more parsing issues, call `problems()` on#> your data frame for details, e.g.:#> dat <- vroom(...)#> problems(dat)
problems(ex_flights)
#> # A tibble: 17 × 5#> row col expected actual file #> <int> <int> <chr> <chr> <chr> #> 1 3 1 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…#> 2 3 3 an integer -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…#> 3 3 4 1/0/T/F/TRUE/FALSE -- C:/Users/admin/Desktop/QFwR_2022.09/slides/L06_Im…#> # … with 14 more rows
# 全部读入,再行转换类型read_csv( "data/ex_flights.csv", col_types = cols( .default = col_character() )) %>% type_convert(col_types = "iiilcT")
#> # A tibble: 20 × 6#> month day dep_time dep_delay tailnum time_hour #> <int> <int> <int> <lgl> <chr> <dttm> #> 1 6 26 932 TRUE -- 2013-06-26 12:00:00#> 2 NA 5 NA NA -- 2013-12-05 17:00:00#> 3 7 20 656 FALSE N17245 2013-07-20 10:00:00#> # … with 17 more rows
readr
包一次性读入多个相同性质的文档
# 生成相同性质的示例文档nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
readr
包一次性读入多个相同性质的文档
# 生成相同性质的示例文档nycflights13::flights %>% split(.$carrier) %>% purrr::iwalk(~ write_tsv(.x, glue::glue("data/flights_{.y}.tsv")))
# 文件路径向量(files <- dir(path = "data/", pattern = "\\.tsv$", full.names = TRUE))
#> [1] "data/flights_9E.tsv" "data/flights_AA.tsv" "data/flights_AS.tsv"#> [4] "data/flights_B6.tsv" "data/flights_DL.tsv" "data/flights_EV.tsv"#> [7] "data/flights_F9.tsv" "data/flights_FL.tsv" "data/flights_HA.tsv"#> [10] "data/flights_MQ.tsv" "data/flights_OO.tsv" "data/flights_UA.tsv"#> [13] "data/flights_US.tsv" "data/flights_VX.tsv" "data/flights_WN.tsv"#> [16] "data/flights_YV.tsv"
# 一次性读入(并纵向合并为 tibble)read_tsv(files, id = "fpath") # 这里通过设置 id 参数将文件路径存入指定变量fpath中
#> # A tibble: 336,776 × 20#> fpath year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>#> 1 data/f… 2013 1 1 810 810 0 1048 1037 11 9E 3538#> 2 data/f… 2013 1 1 1451 1500 -9 1634 1636 -2 9E 4105#> 3 data/f… 2013 1 1 1452 1455 -3 1637 1639 -2 9E 3295#> # … with 336,773 more rows, 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and#> # abbreviated variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,#> # ⁵sched_arr_time, ⁶arr_delay
readr
包readr
的图形化用户界面 *
* RStudio
右上 Environment
标签页 > Import Dataset
> From Text (readr)
...
readr::read_rds()
: R
data files (.rds)readr::read_rds()
: R
data files (.rds)
{{: Read and Write Rectangular Text Data Quicklyvroom
}}
{{
readxl
}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{
haven
}}: SPSS, Stata, and SAS files
{{
arrow
}}: Apache Arrow
readr::read_rds()
: R
data files (.rds)
{{: Read and Write Rectangular Text Data Quicklyvroom
}}
{{
readxl
}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{
haven
}}: SPSS, Stata, and SAS files
{{
arrow
}}: Apache Arrow
readr::read_rds()
: R
data files (.rds)
{{: Read and Write Rectangular Text Data Quicklyvroom
}}
{{
readxl
}}: excel files (.xls and .xlsx) <- take a 🧐 at it!
{{
haven
}}: SPSS, Stata, and SAS files
{{
arrow
}}: Apache Arrow
(Tidy Data)
齐整数据的三个条件 📐
每列都是一个变量(Every column is a variable.)
每行都是一个观测(Every row is an observation.)
每格都是一个取值(Every cell is a single value.)
齐整数据的三个条件 📐
每列都是一个变量(Every column is a variable.)
每行都是一个观测(Every row is an observation.)
每格都是一个取值(Every cell is a single value.)
齐整数据的好处 💪
☑ 齐整数据按照逻辑一致的方式存储数据,这让你更容易学习并掌握相关工具对数据进行处理
☑ “每列都是一个变量”及“每行都是一个观测”有助于发挥 R 语言向量化操作的优势
☑ tidyverse
中的 R 包(如 dplyr
等)在设计上要求输入数据为齐整数据
但……并非所有的数据集都是齐整的,😭
table1
#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> # … with 3 more rows
table2
#> # A tibble: 12 × 4#> country year type count#> <chr> <int> <chr> <int>#> 1 Afghanistan 1999 cases 745#> 2 Afghanistan 1999 population 19987071#> 3 Afghanistan 2000 cases 2666#> # … with 9 more rows
table3
#> # A tibble: 6 × 3#> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362#> # … with 3 more rows
table4a # cases
#> # A tibble: 3 × 3#> country `1999` `2000`#> * <chr> <int> <int>#> 1 Afghanistan 745 2666#> 2 Brazil 37737 80488#> 3 China 212258 213766
table4b # population
#> # A tibble: 3 × 3#> country `1999` `2000`#> * <chr> <int> <int>#> 1 Afghanistan 19987071 20595360#> 2 Brazil 172006362 174504898#> 3 China 1272915272 1280428583
tidyr
包出场啦 🎉tidyr
包中的函数大致可分为5大类:
数据长型-宽型转换(pivoting):pivot_longer()
和 pivot_wider()
分解或合并字符型变量:separate()
、separate_rows()
、extract()
和 unite()
将深度嵌套的列表数据表格化(rectangling):unnest_longer()
、unnest_wider()
、unnest_auto()
和 hoist()
将分组数据框转化成嵌套的数据框(每组占一行)或反之: nest()
和 unnest()
缺失值处理:->
tidyr
包出场啦 🎉tidyr
包中的函数大致可分为5大类:
数据长型-宽型转换(pivoting):pivot_longer()
和 pivot_wider()
分解或合并字符型变量:separate()
、separate_rows()
、extract()
和 unite()
将深度嵌套的列表数据表格化(rectangling):unnest_longer()
、unnest_wider()
、unnest_auto()
和 hoist()
将分组数据框转化成嵌套的数据框(每组占一行)或反之: nest()
和 unnest()
缺失值处理:->
complete(data, ..., fill = list())
: Complete a data frame with missing combinations of data
expand(data, ...)
, crossing(...)
, nesting(...)
: Expand data frame to include all combinations of values
expand_grid(...)
: Create a tibble from all combinations of the inputted name-value pairs
full_seq(x, period, tol = 1e-06)
: Create the full sequence of values in a vector
drop_na(data, ...)
: Drop rows containing missing values
fill(data, ..., .direction = c("down", "up", "downup", "updown"))
: Fill in missing values with previous or next value
replace_na(data, replace, ...)
: Replace missing values with specified values
pivot_longer()
和 pivot_wider()
pivot_longer()
和 pivot_wider()
pivot_longer( data, cols, names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, ...)
pivot_longer()
和 pivot_wider()
pivot_longer( data, cols, names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, ...)
table4a %>% pivot_longer( cols = c(`1999`, `2000`), names_to = "year", values_to = "cases" )# table4b %>% ...
#> # A tibble: 6 × 3#> country year cases#> <chr> <chr> <int>#> 1 Afghanistan 1999 745#> 2 Afghanistan 2000 2666#> 3 Brazil 1999 37737#> # … with 3 more rows
pivot_longer()
和 pivot_wider()
pivot_longer()
和 pivot_wider()
pivot_wider( data, id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, ...)
pivot_longer()
和 pivot_wider()
pivot_wider( data, id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, ...)
table2 %>% pivot_wider( names_from = type, values_from = count )
#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> # … with 3 more rows
separate()
、extract()
和 unite()
separate()
、extract()
和 unite()
separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ...)separate_rows(data, ..., sep, convert)
separate()
、extract()
和 unite()
separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ...)separate_rows(data, ..., sep, convert)
table3 %>% separate( rate, into = c("cases", "population"), convert = TRUE )
#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> # … with 3 more rows
separate()
、extract()
和 unite()
separate()
、extract()
和 unite()
unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
separate()
、extract()
和 unite()
unite( data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
table5 %>% unite( col = "year", century, year, sep = "" )
#> # A tibble: 6 × 3#> country year rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362#> # … with 3 more rows
who # ?who
#> # A tibble: 7,240 × 60#> country iso2 iso3 year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷#> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>#> 1 Afghanistan AF AFG 1980 NA NA NA NA NA NA NA#> 2 Afghanistan AF AFG 1981 NA NA NA NA NA NA NA#> 3 Afghanistan AF AFG 1982 NA NA NA NA NA NA NA#> 4 Afghanistan AF AFG 1983 NA NA NA NA NA NA NA#> 5 Afghanistan AF AFG 1984 NA NA NA NA NA NA NA#> 6 Afghanistan AF AFG 1985 NA NA NA NA NA NA NA#> 7 Afghanistan AF AFG 1986 NA NA NA NA NA NA NA#> 8 Afghanistan AF AFG 1987 NA NA NA NA NA NA NA#> 9 Afghanistan AF AFG 1988 NA NA NA NA NA NA NA#> 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA NA#> # … with 7,230 more rows, 49 more variables: new_sp_f014 <int>, new_sp_f1524 <int>,#> # new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,#> # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,#> # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,#> # new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,#> # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,#> # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, …
who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE )who1
#> # A tibble: 76,046 × 6#> country iso2 iso3 year name cases#> <chr> <chr> <chr> <int> <chr> <int>#> 1 Afghanistan AF AFG 1997 new_sp_m014 0#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6#> # … with 76,043 more rows
who1 <- who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE )who1
#> # A tibble: 76,046 × 6#> country iso2 iso3 year name cases#> <chr> <chr> <chr> <int> <chr> <int>#> 1 Afghanistan AF AFG 1997 new_sp_m014 0#> 2 Afghanistan AF AFG 1997 new_sp_m1524 10#> 3 Afghanistan AF AFG 1997 new_sp_m2534 6#> # … with 76,043 more rows
变量
name
中的字符串代表什么?通过查阅who
的数据字典得到如下信息:
- 前3个字母代表是否是“新增”肺结核病例,在数据中均为
new
- 接下来2-3个字母代表肺结核的类型(
sp
、sn
、ep
、rel
等)- 接下来的字母
m
或f
代表肺结核病人的性别(男或女)- 余下数字代表年龄组,如
014 = 0 - 14岁 | 1524 = 15 - 24岁
who2 <- who1 %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符who2
#> # A tibble: 76,046 × 7#> country iso2 iso3 year type sexage cases#> <chr> <chr> <chr> <int> <chr> <chr> <int>#> 1 Afghanistan AF AFG 1997 sp m014 0#> 2 Afghanistan AF AFG 1997 sp m1524 10#> 3 Afghanistan AF AFG 1997 sp m2534 6#> # … with 76,043 more rows
who2 <- who1 %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage"), sep = "_") # 基于分隔符who2
#> # A tibble: 76,046 × 7#> country iso2 iso3 year type sexage cases#> <chr> <chr> <chr> <int> <chr> <chr> <int>#> 1 Afghanistan AF AFG 1997 sp m014 0#> 2 Afghanistan AF AFG 1997 sp m1524 10#> 3 Afghanistan AF AFG 1997 sp m2534 6#> # … with 76,043 more rows
who3 <- who2 %>% separate(sexage, c("sex", "age"), sep = 1) # 基于位置who3
#> # A tibble: 76,046 × 8#> country iso2 iso3 year type sex age cases#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>#> 1 Afghanistan AF AFG 1997 sp m 014 0#> 2 Afghanistan AF AFG 1997 sp m 1524 10#> 3 Afghanistan AF AFG 1997 sp m 2534 6#> # … with 76,043 more rows
# all in a pipe!who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1)
# all in a pipe!who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "name", values_to = "cases", values_drop_na = TRUE) %>% mutate(name = stringr::str_replace(name, "newrel", "new_rel")) %>% separate(name, c(NA, "type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1)
# all with powerful pivot_longer()!who %>% pivot_longer( cols = starts_with("new"), names_to = c("type", "sex", "age"), names_pattern = "new_?(.*)_(.)(.*)", names_ptypes = list( sex = factor(levels = c("f", "m")), age = factor( levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), ordered = TRUE)), values_to = "cases", values_drop_na = TRUE )
#> # A tibble: 76,046 × 8#> country iso2 iso3 year type sex #> <chr> <chr> <chr> <int> <chr> <fct>#> 1 Afghani… AF AFG 1997 sp m #> 2 Afghani… AF AFG 1997 sp m #> 3 Afghani… AF AFG 1997 sp m #> 4 Afghani… AF AFG 1997 sp m #> 5 Afghani… AF AFG 1997 sp m #> # … with 76,041 more rows, and 2 more#> # variables: age <ord>, cases <int>
👉 vignette("pivot")
05:00
在空白处填入恰当的代码将数据集 billboard 转化为目标结构:
library(tidyverse)
billboard %>% View()
billboard %>%
pivot_ (
cols = ,
names_to = ,
,
names_transform = ,
values_to = ,
values_drop_na = TRUE
)
# A tibble: 317 × 79 artist track date.ent…¹ wk1 wk2 <chr> <chr> <date> <dbl> <dbl>1 2 Pac Baby… 2000-02-26 87 822 2Ge+her The … 2000-09-02 91 873 3 Doors … Kryp… 2000-04-08 81 70# … with 314 more rows, 74 more# variables: wk3 <dbl>, wk4 <dbl>,# wk5 <dbl>, wk6 <dbl>, wk7 <dbl>,# wk8 <dbl>, wk9 <dbl>, wk10 <dbl>,# wk11 <dbl>, wk12 <dbl>, wk13 <dbl>,# wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,# wk17 <dbl>, wk18 <dbl>, …
# A tibble: 5,307 × 5 artist track date.ent…¹ week rank <chr> <chr> <date> <int> <dbl>1 2 Pac Baby Do… 2000-02-26 1 872 2 Pac Baby Do… 2000-02-26 2 823 2 Pac Baby Do… 2000-02-26 3 72# … with 5,304 more rows, and# abbreviated variable name# ¹date.entered
library(tidyverse)billboard %>% View()billboard %>% pivot_longer( cols = starts_with("wk") , names_to = "week" , names_prefix = "wk" , names_transform = list(week = as.integer), values_to = "rank" , values_drop_na = TRUE )
unnest_*
和 hoist()
repurrrsive::gh_repos %>% View()
unnest_*
和 hoist()
repurrrsive::gh_repos %>% View()
(repos <- tibble(repo = repurrrsive::gh_repos))
#> # A tibble: 6 × 1#> repo #> <list> #> 1 <list [30]>#> 2 <list [30]>#> 3 <list [30]>#> 4 <list [26]>#> 5 <list [30]>#> # … with 1 more row
(repos <- repos %>% unnest_longer(repo))
#> # A tibble: 176 × 1#> repo #> <list> #> 1 <named list [68]>#> 2 <named list [68]>#> 3 <named list [68]>#> 4 <named list [68]>#> 5 <named list [68]>#> # … with 171 more rows
unnest_*
和 hoist()
repos %>% unnest_wider(repo)
#> # A tibble: 176 × 68#> id name full_…¹ owner #> <int> <chr> <chr> <list> #> 1 61160198 after gaborc… <named list>#> 2 40500181 argufy gaborc… <named list>#> 3 36442442 ask gaborc… <named list>#> 4 34924886 baseimp… gaborc… <named list>#> 5 61620661 citest gaborc… <named list>#> # … with 171 more rows, 64 more#> # variables: private <lgl>,#> # html_url <chr>, description <chr>,#> # fork <lgl>, url <chr>,#> # forks_url <chr>, keys_url <chr>,#> # collaborators_url <chr>,#> # teams_url <chr>, hooks_url <chr>, …
unnest_*
和 hoist()
repos %>% unnest_wider(repo)
#> # A tibble: 176 × 68#> id name full_…¹ owner #> <int> <chr> <chr> <list> #> 1 61160198 after gaborc… <named list>#> 2 40500181 argufy gaborc… <named list>#> 3 36442442 ask gaborc… <named list>#> 4 34924886 baseimp… gaborc… <named list>#> 5 61620661 citest gaborc… <named list>#> # … with 171 more rows, 64 more#> # variables: private <lgl>,#> # html_url <chr>, description <chr>,#> # fork <lgl>, url <chr>,#> # forks_url <chr>, keys_url <chr>,#> # collaborators_url <chr>,#> # teams_url <chr>, hooks_url <chr>, …
repos %>% hoist( repo, login = c("owner", "login"), name = "name", homepage = "homepage", watchers = "watchers_count")
#> # A tibble: 176 × 5#> login name homep…¹ watch…²#> <chr> <chr> <chr> <int>#> 1 gaborcsardi after <NA> 5#> 2 gaborcsardi argufy <NA> 19#> 3 gaborcsardi ask <NA> 5#> 4 gaborcsardi baseimpor… <NA> 0#> 5 gaborcsardi citest <NA> 0#> # … with 171 more rows, 1 more#> # variable: repo <list>, and#> # abbreviated variable names#> # ¹homepage, ²watchers
👉 vignette("rectangle")
1. 复习 📖 R for Data Science 一书的第10章、第11章和第12章的内容
2. 下载(打印) 📰 {{Data Import的cheatsheet}} 和 {{Data Tidying的cheatsheet}} 并阅读之
3. browseVignettes(package = c("readr", "tidyr"))
📝
Introduction to readrTidy dataPivoting
4. 完成第六讲的课后练习 👩💻
remotes::install_github("qfwr2022/qfwr")# 记得选择允许安装qfwr包所依赖的learnr、testwhat等包library(qfwr)qfwr_ex("L06")