简体   繁体   English

R - 使用组合从宽到长格式

[英]R - from wide to long format using combinations

Assume I have the following database df . 假设我有以下数据库df

df <- data.frame(ID= c("A", "B", "C"),
             Var1 = c(234, 12, 345),
             Var2 = c(4, 555, 325),
             Var3 = c("45|221|2", "982", NA))

> df
  ID Var1 Var2     Var3
1  A  234    4 45|221|2
2  B   12  555      982
3  C  345  325     <NA>

I would like to create a data.frame in which Var1 and Var2 is combined with the elements in Var3 by ID . 我想创建一个data.frame,其中Var1Var2通过IDVar3的元素组合。

The outcome I am looking for should look like the following: 我正在寻找的结果应如下所示:

> outcome
  ID VarA VarB
1  A  234   45
2  A  234  221
3  A  234    2
4  A    4   45
5  A    4  221
6  A    4    2
7  B   12  982
8  B  555  982

Note that: 注意:

  • the elements in Var3 are separated by a vertical bar | Var3中的元素由竖线|分隔
  • ID == C is not in outcome because Var3 is NA for that ID . ID == C不在outcome因为Var3是该ID NA

The original data consists of millions of IDs. 原始数据包含数百万个ID。

We can use tidyverse for a fairly elegant solution. 我们可以使用tidyverse来获得相当优雅的解决方案。 The general idea is that we can use separate_rows to expand Var3 into rows, we just need to get Var1/Var2 into a suitable long format so we don't unnecessarily duplicate values. 一般的想法是我们可以使用separate_rowsVar3扩展为行,我们只需要将Var1/Var2变为合适的长格式,这样我们就不会不必要地复制值。

library(tidyverse)
library(stringr)

df %>% gather(variable, value, -ID, -Var3) %>% # pull Var1 and Var2 into 
  # a single pair of key/value columns
  separate_rows(Var3, sep = "\\|") %>% # split Var3 into rows for each value
  drop_na(Var3) %>% # drop the NA rows
  select(ID, VarA = value, VarB = Var3, -variable) %>%
  arrange(ID)

  ID VarA VarB
1  A  234   45
2  A  234  221
3  A  234    2
4  A    4   45
5  A    4  221
6  A    4    2
7  B   12  982
8  B  555  982

With tidyverse and splitstackshape you can do: 使用tidyversesplitstackshape您可以:

df %>%
 filter(!is.na(Var3)) %>%
 select(-Var3) %>%
 gather(var, VarA, -ID) %>%
 select(-var) %>%
 full_join(df %>%
            filter(!is.na(Var3)) %>%
            cSplit("Var3", sep = "|") %>%
            select(-Var1, -Var2) %>%
            gather(var, VarB, -ID, na.rm = TRUE) %>%
            select(-var), by = c("ID" = "ID")) %>%
 arrange(ID, VarA, VarB)

  ID VarA VarB
1  A    4    2
2  A    4   45
3  A    4  221
4  A  234    2
5  A  234   45
6  A  234  221
7  B   12  982
8  B  555  982

First, it filters out the rows where there is a NA on "Var3". 首先,它过滤掉“Var3”上存在NA的行。 Second it transforms the data from wide to long format, without the variable "Var3". 其次,它将数据从宽格式转换为长格式,而不使用变量“Var3”。 Finally, it performs a full join with the df where the rows with NA on "Var3" were filtered out and "Var3" was split based on "|" 最后,它执行与df的完全连接,其中“Var3”上的NA行被过滤掉,“Var3”基于“|”分割 and then transformed to wide to long format, without "Var1" and "Var2". 然后转换为从长到长的格式,没有“Var1”和“Var2”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM