R - 使用组合从宽到长格式

Question

Assume I have the following database df . 假设我有以下数据库df 。

df <- data.frame(ID= c("A", "B", "C"),
             Var1 = c(234, 12, 345),
             Var2 = c(4, 555, 325),
             Var3 = c("45|221|2", "982", NA))

> df
  ID Var1 Var2     Var3
1  A  234    4 45|221|2
2  B   12  555      982
3  C  345  325     <NA>

I would like to create a data.frame in which Var1 and Var2 is combined with the elements in Var3 by ID . 我想创建一个data.frame，其中Var1和Var2通过ID与Var3的元素组合。

The outcome I am looking for should look like the following: 我正在寻找的结果应如下所示：

> outcome
  ID VarA VarB
1  A  234   45
2  A  234  221
3  A  234    2
4  A    4   45
5  A    4  221
6  A    4    2
7  B   12  982
8  B  555  982

Note that: 注意：

the elements in Var3 are separated by a vertical bar | Var3中的元素由竖线|分隔
ID == C is not in outcome because Var3 is NA for that ID . ID == C不在outcome因为Var3是该ID NA 。

The original data consists of millions of IDs. 原始数据包含数百万个ID。

Answer 1

We can use tidyverse for a fairly elegant solution. 我们可以使用tidyverse来获得相当优雅的解决方案。 The general idea is that we can use separate_rows to expand Var3 into rows, we just need to get Var1/Var2 into a suitable long format so we don't unnecessarily duplicate values. 一般的想法是我们可以使用separate_rows将Var3扩展为行，我们只需要将Var1/Var2变为合适的长格式，这样我们就不会不必要地复制值。

library(tidyverse)
library(stringr)

df %>% gather(variable, value, -ID, -Var3) %>% # pull Var1 and Var2 into 
  # a single pair of key/value columns
  separate_rows(Var3, sep = "\\|") %>% # split Var3 into rows for each value
  drop_na(Var3) %>% # drop the NA rows
  select(ID, VarA = value, VarB = Var3, -variable) %>%
  arrange(ID)

  ID VarA VarB
1  A  234   45
2  A  234  221
3  A  234    2
4  A    4   45
5  A    4  221
6  A    4    2
7  B   12  982
8  B  555  982

Answer 2

With tidyverse and splitstackshape you can do: 使用tidyverse和splitstackshape您可以：

df %>%
 filter(!is.na(Var3)) %>%
 select(-Var3) %>%
 gather(var, VarA, -ID) %>%
 select(-var) %>%
 full_join(df %>%
            filter(!is.na(Var3)) %>%
            cSplit("Var3", sep = "|") %>%
            select(-Var1, -Var2) %>%
            gather(var, VarB, -ID, na.rm = TRUE) %>%
            select(-var), by = c("ID" = "ID")) %>%
 arrange(ID, VarA, VarB)

  ID VarA VarB
1  A    4    2
2  A    4   45
3  A    4  221
4  A  234    2
5  A  234   45
6  A  234  221
7  B   12  982
8  B  555  982

First, it filters out the rows where there is a NA on "Var3". 首先，它过滤掉“Var3”上存在NA的行。 Second it transforms the data from wide to long format, without the variable "Var3". 其次，它将数据从宽格式转换为长格式，而不使用变量“Var3”。 Finally, it performs a full join with the df where the rows with NA on "Var3" were filtered out and "Var3" was split based on "|" 最后，它执行与df的完全连接，其中“Var3”上的NA行被过滤掉，“Var3”基于“|”分割 and then transformed to wide to long format, without "Var1" and "Var2". 然后转换为从长到长的格式，没有“Var1”和“Var2”。

R - 使用组合从宽到长格式

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-22 22:11:50

解决方案2
1 2019-01-22 22:25:13

R - 使用组合从宽到长格式

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-22 22:11:50

解决方案2 1 2019-01-22 22:25:13

解决方案1
2 已采纳 2019-01-22 22:11:50

解决方案2
1 2019-01-22 22:25:13