[英]Need to split a column containing varying numbers of doubly concatenated data of variable names and observations
I have a column "sample_values" with varying numbers of doubly concatenated data delimited with both "," and ":" characters. 我有一列“ sample_values”,其中包含用“,”和“:”字符分隔的不同数量的双重连接数据。 I need to make the values separated by "," into new variables (columns) and the values separated by ":" the observations of those new variables.
我需要将用“,”分隔的值变成新变量(列),并用“:”分隔这些值以观察这些新变量。 A small subset of the problematic data.frame is shown here:
这里显示了有问题的data.frame的一小部分:
```{r}
> CDR3 <- c("CASSKGTGGPYEQYF", "CASSSDTDPSYGYTF", "CASSFGTGKNTEAFF", "CASSPRPRYYEQYF")
> sample_values <- c("sample_a:36,sample_b:24,sample_c:56", "sample_a:47", "sample_a:73,sample_b:12", "sample_c:76,sample_d:89")
> df <- data.frame(CDR3, sample_values)
> df
CDR3 sample_values
1 CASSKGTGGPYEQYF sample_a:36,sample_b:24,sample_c:56
2 CASSSDTDPSYGYTF sample_a:47
3 CASSFGTGKNTEAFF sample_a:73,sample_b:12
4 CASSPRPRYYEQYF sample_c:76,sample_d:8
```
I would like to end up with the following result: 我想得出以下结果:
```{r}
CDR3 sample_a sample_b sample_c sample_d
1 CASSKGTGGPYEQYF 36 24 56 0
2 CASSSDTDPSYGYTF 47 0 0 0
3 CASSFGTGKNTEAFF 73 12 0 0
4 CASSPRPRYYEQYF 0 0 76 89
```
I will note that an absence of an observation should be interpreted as zero. 我将注意到,没有观察值应解释为零。
I've attempted this using various combinations of separate()
and spread()
from the tidyr
package as well as using cSplit()
from the splitstackshape
package. 我已经尝试过使用
tidyr
包中的tidyr
separate()
和spread()
各种组合,以及tidyr
包中的cSplit()
来进行此splitstackshape
。 The tidyr
options failed because of differing numbers of observations to separate in the column, and the splitstackshape
option failed due to insufficient memory (the unabridged data file is 485 MB in size). tidyr
选项由于在列中要分离的观察点数量不同而失败,而splitstackshape
选项由于内存不足(未删节的数据文件大小为485 MB)而失败。
Using tidyverse
we can first bring all sample_values
into individual rows, then separate
column names and values into individual columns and finally spread
it to wide format filling missing values with 0. 使用
tidyverse
我们可以首先将所有sample_values
带入单独的行,然后separate
列名和值separate
到单独的列中,最后spread
其spread
为宽格式,以0填充缺失值。
library(tidyverse)
df %>%
separate_rows(sample_values, sep = ",") %>%
separate(sample_values, into = c("col", "values"), sep = ":") %>%
spread(col, values, fill = 0)
# CDR3 sample_a sample_b sample_c sample_d
# <fct> <chr> <chr> <chr> <chr>
#1 CASSFGTGKNTEAFF 73 12 0 0
#2 CASSKGTGGPYEQYF 36 24 56 0
#3 CASSPRPRYYEQYF 0 0 76 89
#4 CASSSDTDPSYGYTF 47 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.