在 R 中多次收集以创建整洁的数据集

Question

I have a complicated untidy dataset which a dummy version of can be replicated below.我有一个复杂的不整洁的数据集，可以在下面复制一个虚拟版本。

studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)

untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)

I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance.我想收集这个数据框，以便我们只有 5 列：studentID、year、score、payment、attendance。 I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.我知道如何在基本级别收集，但我有 3 组要在这里收集，但我看不到如何在一个 go 中进行收集。

Thanks in advance!提前致谢！

Answer 1

With tidyr you can use pivot_longer :使用tidyr ，您可以使用pivot_longer ：

library(tidyr)

untidy_df %>%
  pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")

Output Output

# A tibble: 750 x 5
   studentID year    score payment attendance
       <int> <chr>   <dbl>   <dbl>      <dbl>
 1         1 2018  0.432    10762.      0.786
 2         1 2019  0.948    11340.      0.909
 3         1 2020  0.122    12837.      0.944
 4         2 2018  0.422    11515.      0.950
 5         2 2019  0.0639   12968.      0.828
 6         2 2020  0.611    13645.      0.901
 7         3 2018  0.489    11281.      0.784
 8         3 2019  0.00337  12250.      0.753
 9         3 2020  0.711    12898.      0.803
10         4 2018  0.0596   10526.      0.842

Answer 2

Using pure R:使用纯 R：

tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)

       studentID time      score  payment attendance
1.2018         1 2018 0.86743970 10995.45  0.9473540
2.2018         2 2018 0.53204701 11152.74  0.8167776
3.2018         3 2018 0.90072918 10631.06  0.9335316
4.2018         4 2018 0.89154492 11889.23  0.9098399
5.2018         5 2018 0.06320442 10973.20  0.8118909
6.2018         6 2018 0.67519166 11751.67  0.8328860

If you want "year" instead of the default "time", add timevar="year"如果您想要“年”而不是默认的“时间”，请添加timevar="year"

Answer 3

We could try:我们可以尝试：

library(dplyr)
library(tidyr)

untidy_df %>% 
  pivot_longer(cols = -studentID) %>% 
  separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>% 
  pivot_wider(names_from = measure, values_from = value )

Which returns:哪个返回：

 studentID year score payment attendance <int> <chr> <dbl> <dbl> <dbl> 1 1 2018 0.807 10179. 0.974 2 1 2019 0.599 11601. 0.785 3 1 2020 0.515 12347. 0.760 4 2 2018 0.474 11154. 0.983 5 2 2019 0.409 11682. 0.864 6 2 2020 0.688 13756. 0.812 7 3 2018 0.509 11746. 0.870 8 3 2019 0.867 12851. 0.801 9 3 2020 0.878 12710. 0.955 10 4 2018 0.621 11165. 0.975

在 R 中多次收集以创建整洁的数据集

问题描述

3 个解决方案

解决方案1
4 2020-06-11 15:01:30

解决方案2
2 2020-06-11 15:06:02

解决方案3
1 2020-06-11 15:00:59

在 R 中多次收集以创建整洁的数据集

问题描述

3 个解决方案

解决方案1 4 2020-06-11 15:01:30

解决方案2 2 2020-06-11 15:06:02

解决方案3 1 2020-06-11 15:00:59

解决方案1
4 2020-06-11 15:01:30

解决方案2
2 2020-06-11 15:06:02

解决方案3
1 2020-06-11 15:00:59