简体   繁体   English

需要拆分一列,包含不同数量的变量名称和观察值的双重连接数据

[英]Need to split a column containing varying numbers of doubly concatenated data of variable names and observations

I have a column "sample_values" with varying numbers of doubly concatenated data delimited with both "," and ":" characters. 我有一列“ sample_values”,其中包含用“,”和“:”字符分隔的不同数量的双重连接数据。 I need to make the values separated by "," into new variables (columns) and the values separated by ":" the observations of those new variables. 我需要将用“,”分隔的值变成新变量(列),并用“:”分隔这些值以观察这些新变量。 A small subset of the problematic data.frame is shown here: 这里显示了有问题的data.frame的一小部分:

```{r}
> CDR3 <- c("CASSKGTGGPYEQYF", "CASSSDTDPSYGYTF", "CASSFGTGKNTEAFF", "CASSPRPRYYEQYF")
> sample_values <- c("sample_a:36,sample_b:24,sample_c:56", "sample_a:47", "sample_a:73,sample_b:12", "sample_c:76,sample_d:89")
> df <- data.frame(CDR3, sample_values)
> df
             CDR3                       sample_values
1 CASSKGTGGPYEQYF sample_a:36,sample_b:24,sample_c:56
2 CASSSDTDPSYGYTF                         sample_a:47
3 CASSFGTGKNTEAFF             sample_a:73,sample_b:12
4  CASSPRPRYYEQYF             sample_c:76,sample_d:8
```  

I would like to end up with the following result: 我想得出以下结果:

```{r}
             CDR3 sample_a sample_b sample_c sample_d
1 CASSKGTGGPYEQYF       36       24       56        0
2 CASSSDTDPSYGYTF       47        0        0        0
3 CASSFGTGKNTEAFF       73       12        0        0
4  CASSPRPRYYEQYF        0        0       76       89
```  

I will note that an absence of an observation should be interpreted as zero. 我将注意到,没有观察值应解释为零。

I've attempted this using various combinations of separate() and spread() from the tidyr package as well as using cSplit() from the splitstackshape package. 我已经尝试过使用tidyr包中的tidyr separate()spread()各种组合,以及tidyr包中的cSplit()来进行此splitstackshape The tidyr options failed because of differing numbers of observations to separate in the column, and the splitstackshape option failed due to insufficient memory (the unabridged data file is 485 MB in size). tidyr选项由于在列中要分离的观察点数量不同而失败,而splitstackshape选项由于内存不足(未删节的数据文件大小为485 MB)而失败。

Using tidyverse we can first bring all sample_values into individual rows, then separate column names and values into individual columns and finally spread it to wide format filling missing values with 0. 使用tidyverse我们可以首先将所有sample_values带入单独的行,然后separate列名和值separate到单独的列中,最后spreadspread为宽格式,以0填充缺失值。

library(tidyverse)

df %>%
  separate_rows(sample_values, sep = ",") %>%
  separate(sample_values, into = c("col", "values"), sep = ":") %>%
  spread(col, values, fill = 0)


# CDR3            sample_a sample_b sample_c sample_d
#  <fct>           <chr>    <chr>    <chr>    <chr>   
#1 CASSFGTGKNTEAFF 73       12       0        0       
#2 CASSKGTGGPYEQYF 36       24       56       0       
#3 CASSPRPRYYEQYF  0        0        76       89      
#4 CASSSDTDPSYGYTF 47       0        0        0       

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM