简体   繁体   English

在 R 中多次收集以创建整洁的数据集

[英]Multiple gathering in R to create tidy dataset

I have a complicated untidy dataset which a dummy version of can be replicated below.我有一个复杂的不整洁的数据集,可以在下面复制一个虚拟版本。

studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)

untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)

I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance.我想收集这个数据框,以便我们只有 5 列:studentID、year、score、payment、attendance。 I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.我知道如何在基本级别收集,但我有 3 组要在这里收集,但我看不到如何在一个 go 中进行收集。

Thanks in advance!提前致谢!

With tidyr you can use pivot_longer :使用tidyr ,您可以使用pivot_longer

library(tidyr)

untidy_df %>%
  pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")

Output Output

# A tibble: 750 x 5
   studentID year    score payment attendance
       <int> <chr>   <dbl>   <dbl>      <dbl>
 1         1 2018  0.432    10762.      0.786
 2         1 2019  0.948    11340.      0.909
 3         1 2020  0.122    12837.      0.944
 4         2 2018  0.422    11515.      0.950
 5         2 2019  0.0639   12968.      0.828
 6         2 2020  0.611    13645.      0.901
 7         3 2018  0.489    11281.      0.784
 8         3 2019  0.00337  12250.      0.753
 9         3 2020  0.711    12898.      0.803
10         4 2018  0.0596   10526.      0.842

Using pure R:使用纯 R:

tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)

       studentID time      score  payment attendance
1.2018         1 2018 0.86743970 10995.45  0.9473540
2.2018         2 2018 0.53204701 11152.74  0.8167776
3.2018         3 2018 0.90072918 10631.06  0.9335316
4.2018         4 2018 0.89154492 11889.23  0.9098399
5.2018         5 2018 0.06320442 10973.20  0.8118909
6.2018         6 2018 0.67519166 11751.67  0.8328860

If you want "year" instead of the default "time", add timevar="year"如果您想要“年”而不是默认的“时间”,请添加timevar="year"

We could try:我们可以尝试:

library(dplyr)
library(tidyr)

untidy_df %>% 
  pivot_longer(cols = -studentID) %>% 
  separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>% 
  pivot_wider(names_from = measure, values_from = value )

Which returns:哪个返回:

 studentID year score payment attendance <int> <chr> <dbl> <dbl> <dbl> 1 1 2018 0.807 10179. 0.974 2 1 2019 0.599 11601. 0.785 3 1 2020 0.515 12347. 0.760 4 2 2018 0.474 11154. 0.983 5 2 2019 0.409 11682. 0.864 6 2 2020 0.688 13756. 0.812 7 3 2018 0.509 11746. 0.870 8 3 2019 0.867 12851. 0.801 9 3 2020 0.878 12710. 0.955 10 4 2018 0.621 11165. 0.975

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM