[英]R - from wide to long format using combinations
Assume I have the following database df
. 假设我有以下数据库
df
。
df <- data.frame(ID= c("A", "B", "C"),
Var1 = c(234, 12, 345),
Var2 = c(4, 555, 325),
Var3 = c("45|221|2", "982", NA))
> df
ID Var1 Var2 Var3
1 A 234 4 45|221|2
2 B 12 555 982
3 C 345 325 <NA>
I would like to create a data.frame in which Var1
and Var2
is combined with the elements in Var3
by ID
. 我想创建一个data.frame,其中
Var1
和Var2
通过ID
与Var3
的元素组合。
The outcome I am looking for should look like the following: 我正在寻找的结果应如下所示:
> outcome
ID VarA VarB
1 A 234 45
2 A 234 221
3 A 234 2
4 A 4 45
5 A 4 221
6 A 4 2
7 B 12 982
8 B 555 982
Note that: 注意:
Var3
are separated by a vertical bar |
Var3
中的元素由竖线|
分隔 ID == C
is not in outcome
because Var3
is NA
for that ID
. ID == C
不在outcome
因为Var3
是该ID
NA
。 The original data consists of millions of IDs. 原始数据包含数百万个ID。
We can use tidyverse
for a fairly elegant solution. 我们可以使用
tidyverse
来获得相当优雅的解决方案。 The general idea is that we can use separate_rows
to expand Var3
into rows, we just need to get Var1/Var2
into a suitable long format so we don't unnecessarily duplicate values. 一般的想法是我们可以使用
separate_rows
将Var3
扩展为行,我们只需要将Var1/Var2
变为合适的长格式,这样我们就不会不必要地复制值。
library(tidyverse)
library(stringr)
df %>% gather(variable, value, -ID, -Var3) %>% # pull Var1 and Var2 into
# a single pair of key/value columns
separate_rows(Var3, sep = "\\|") %>% # split Var3 into rows for each value
drop_na(Var3) %>% # drop the NA rows
select(ID, VarA = value, VarB = Var3, -variable) %>%
arrange(ID)
ID VarA VarB
1 A 234 45
2 A 234 221
3 A 234 2
4 A 4 45
5 A 4 221
6 A 4 2
7 B 12 982
8 B 555 982
With tidyverse
and splitstackshape
you can do: 使用
tidyverse
和splitstackshape
您可以:
df %>%
filter(!is.na(Var3)) %>%
select(-Var3) %>%
gather(var, VarA, -ID) %>%
select(-var) %>%
full_join(df %>%
filter(!is.na(Var3)) %>%
cSplit("Var3", sep = "|") %>%
select(-Var1, -Var2) %>%
gather(var, VarB, -ID, na.rm = TRUE) %>%
select(-var), by = c("ID" = "ID")) %>%
arrange(ID, VarA, VarB)
ID VarA VarB
1 A 4 2
2 A 4 45
3 A 4 221
4 A 234 2
5 A 234 45
6 A 234 221
7 B 12 982
8 B 555 982
First, it filters out the rows where there is a NA on "Var3". 首先,它过滤掉“Var3”上存在NA的行。 Second it transforms the data from wide to long format, without the variable "Var3".
其次,它将数据从宽格式转换为长格式,而不使用变量“Var3”。 Finally, it performs a full join with the df where the rows with NA on "Var3" were filtered out and "Var3" was split based on "|"
最后,它执行与df的完全连接,其中“Var3”上的NA行被过滤掉,“Var3”基于“|”分割 and then transformed to wide to long format, without "Var1" and "Var2".
然后转换为从长到长的格式,没有“Var1”和“Var2”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.