[英]Multiple gathering in R to create tidy dataset
I have a complicated untidy dataset which a dummy version of can be replicated below.我有一个复杂的不整洁的数据集,可以在下面复制一个虚拟版本。
studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)
untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)
I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance.我想收集这个数据框,以便我们只有 5 列:studentID、year、score、payment、attendance。 I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.
我知道如何在基本级别收集,但我有 3 组要在这里收集,但我看不到如何在一个 go 中进行收集。
Thanks in advance!提前致谢!
With tidyr
you can use pivot_longer
:使用
tidyr
,您可以使用pivot_longer
:
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")
Output Output
# A tibble: 750 x 5
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.432 10762. 0.786
2 1 2019 0.948 11340. 0.909
3 1 2020 0.122 12837. 0.944
4 2 2018 0.422 11515. 0.950
5 2 2019 0.0639 12968. 0.828
6 2 2020 0.611 13645. 0.901
7 3 2018 0.489 11281. 0.784
8 3 2019 0.00337 12250. 0.753
9 3 2020 0.711 12898. 0.803
10 4 2018 0.0596 10526. 0.842
Using pure R:使用纯 R:
tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)
studentID time score payment attendance
1.2018 1 2018 0.86743970 10995.45 0.9473540
2.2018 2 2018 0.53204701 11152.74 0.8167776
3.2018 3 2018 0.90072918 10631.06 0.9335316
4.2018 4 2018 0.89154492 11889.23 0.9098399
5.2018 5 2018 0.06320442 10973.20 0.8118909
6.2018 6 2018 0.67519166 11751.67 0.8328860
If you want "year" instead of the default "time", add timevar="year"
如果您想要“年”而不是默认的“时间”,请添加
timevar="year"
We could try:我们可以尝试:
library(dplyr)
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID) %>%
separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>%
pivot_wider(names_from = measure, values_from = value )
Which returns:哪个返回:
studentID year score payment attendance <int> <chr> <dbl> <dbl> <dbl> 1 1 2018 0.807 10179. 0.974 2 1 2019 0.599 11601. 0.785 3 1 2020 0.515 12347. 0.760 4 2 2018 0.474 11154. 0.983 5 2 2019 0.409 11682. 0.864 6 2 2020 0.688 13756. 0.812 7 3 2018 0.509 11746. 0.870 8 3 2019 0.867 12851. 0.801 9 3 2020 0.878 12710. 0.955 10 4 2018 0.621 11165. 0.975
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.