简体   繁体   English

根据逗号分隔的字符向量列的值熔合表(data.frame)

[英]Melt a table (data.frame) based on values of comma-separated character vector column

I'm doing an experiment where I have "regions" with some associated statistic (actually many other statistics and descriptive columns), and a comma-separated list of genes that lie in those regions. 我正在做一个实验,其中我有“区域”和一些相关的统计数据(实际上有许多其他统计数据和描述性列),以及位于这些区域的逗号分隔的基因列表。 This list will be variable in number, and may not contain anything ("NA"). 此列表的编号可变,并且可能不包含任何内容(“NA”)。

How can I "melt" table a: 我怎样才能“融化”表格a:

  region_id  statistic      genelist
          1        2.5       A, B, C
          2        0.5    B, C, D, E
          3        3.2          <NA>
          4        0.1          E, F

To create another table with a separate entry for each gene in the list of genes? 为基因列表中的每个基因创建另一个具有单独条目的表格? Ie

   region_id statistic gene
           1       2.5    A
           1       2.5    B
           1       2.5    C
           2       0.5    B
           2       0.5    C
           2       0.5    D
           2       0.5    E
           3       3.2 <NA>
           4       0.1    E
           4       0.1    F

I'm guessing there's a way to do this with R/plyr, but I'm not sure how. 我猜是有办法用R / plyr做这个,但我不知道怎么做。 Thanks in advance. 提前致谢。

Edit: 编辑:

Using R you can recreate these toy vectors with this code: 使用R,您可以使用以下代码重新创建这些玩具向量:

a <- structure(list(region_id = 1:4, statistic = c(2.5, 0.5, 3.2, 
0.1), genelist = structure(c(1L, 2L, NA, 3L), .Label = c("A, B, C", 
"B, C, D, E", "E, F"), class = "factor")), .Names = c("region_id", 
"statistic", "genelist"), class = "data.frame", row.names = c(NA, 
-4L))

b <- structure(list(region_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
4L, 4L), statistic = c(2.5, 2.5, 2.5, 0.5, 0.5, 0.5, 0.5, 3.2, 
0.1, 0.1), gene = structure(c(1L, 2L, 3L, 2L, 3L, 4L, 5L, NA, 
5L, 6L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor")), .Names = c("region_id", 
"statistic", "gene"), class = "data.frame", row.names = c(NA, 
-10L))

A data.table solution for time, memory and coding efficiency data.table于时间,内存和编码效率的data.table解决方案

library(data.table)
DT <- data.table(a)
DT[, list(statistic, 
          gene = unlist(strsplit(as.character(genelist), ', ' ))),
   by = list(region_id)]

Or you could use the nice formatting of of list from data.table version >= 1.8.2 或者你可以使用data.table version> = 1.8.2中列表的漂亮格式

DTL <- DT[, list(statistic, 
         gene = strsplit(as.character(genelist), ', ' )),
    by = list(region_id)]

DTL
##    region_id statistic    gene
## 1:         1       2.5   A,B,C
## 2:         2       0.5 B,C,D,E
## 3:         3       3.2      NA
## 4:         4       0.1     E,F

In which case gene is a list of lists 在这种情况下, gene是一个列表列表

DTL[region_id == 1,unlist(gene)]
## [1] "A" "B" "C"
DTL[region_id == 2,unlist(gene)]
## [1] "B" "C" "D" "E"
# or if the following is of interest
DTL[statistic < 2,unlist(gene)]
## [1] "B" "C" "D" "E" "E" "F"

etc 等等

Simply split the fields, then split the genes and print one line per gene. 简单地分割字段,然后分割基因并打印每个基因一行。 You can try this out in a script by replacing <DATA> with <> and using the input file as argument to the perl script, eg perl script.pl input.txt . 您可以在脚本中尝试将<DATA>替换为<>并使用输入文件作为perl脚本的参数,例如perl script.pl input.txt

use strict;
use warnings;

while (<DATA>) {
    chomp;                                   # remove newline
    my ($reg, $stat, $gene) = split /\t/;    # split fields
    my @genes = split /,\s*/, $gene;         # split genes
    for (@genes) {
        local $\ = "\n";                 # adds newline to print
        print join "\t", $reg, $stat, $_;
    }
}

__DATA__
region_id   statistic   genelist
1   2.5 A, B, C
2   0.5 B, C, D, E
3   3.2 <NA>
4   0.1 E, F

Output: 输出:

region_id       statistic       genelist
1       2.5     A
1       2.5     B
1       2.5     C
2       0.5     B
2       0.5     C
2       0.5     D
2       0.5     E
3       3.2     <NA>
4       0.1     E
4       0.1     F

There are a few ways to do it. 有几种方法可以做到这一点。 This way works, although there may be better ways... 这种方式有效,尽管可能有更好的方法......

library(stringr) # for str_split
join(subset(a, select=c("region_id", "statistic")), 
     ddply(a, .(region_id), summarise, gene=str_split(genelist, ",\\S*")[[1]]))

Needs plyr and stringr loaded. 需要plyr和stringr加载。

Oh, here's a better way: 哦,这是一个更好的方法:

ddply(a, .(region_id), 
      function(x) data.frame(gene=str_split(x$genelist, ",\\S*")[[1]], 
                             statistic=x$statistic))

Here is a way to do it without any libraries: 这是一种没有任何库的方法:

data<-cbind(region_id=1:4, statistic=c(2.5, 0.5, 3.2, 0.1), genelist=c("A, B, C", "B, C, D, E", NA, "E, F"))

do.call(rbind, 
        apply(data, 1, 
              function(r) do.call(expand.grid, 
                                  c(unlist(r[-3]), 
                                    strsplit(r[3], ", ")))))

Output: 输出:

      region_id statistic genelist
1          1       2.5        A
2          1       2.5        B
3          1       2.5        C
4          2       0.5        B
5          2       0.5        C
6          2       0.5        D
7          2       0.5        E
8          3       3.2     <NA>
9          4       0.1        E
10         4       0.1        F

这是另一个使用plyr

ddply(a, .(region_id), transform, gene = str_split(genelist, ',')[[1]])

A Perl solution: Perl解决方案:

#!/usr/bin/perl
<>;
print "region_id\tstatistic\tgene\n";
while(<>) {
  chomp;
  my ($reg, $stat, $genes) = split /\s+/, $_, 3;
  foreach my $gene (split /,\s*/, $genes) {
     print "$reg\t$stat\t$gene\n";
  }
}

Just pipe the original file through this script into the output file. 只需通过此脚本将原始文件传输到输出文件中。

Currently the output values are tab-seperated and not right-flushed, but you can fix that if it is really needed. 目前,输出值是制表符分隔而不是右刷新,但如果确实需要,您可以修复它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Perl中以逗号分隔值的形式输出列表? - How can I output a list as comma-separated values in Perl? Perl - Regex只提取以逗号分隔的字符串 - Perl - Regex to extract only the comma-separated strings Perl 单行配方:即时将列表转换为“单引号中的逗号分隔字符串”或“多行” - Perl one-liner recipe: Convert lists into 'comma-separated strings in single quotes' OR 'multiple lines' on the fly 如何拆分逗号分隔的字符串,而忽略双引号和括号内的逗号? - How can I split a comma-separated string, ignoring commas inside double quotes and parentheses? 从逗号分隔的字符串中提取第二个单词没有尾随空格的最可读正则表达式是什么? - What is the most readable regex to extract a second word with no trailing spaces from comma-separated string? Perl RE检查以逗号分隔的字符串值 - Perl RE to check string values separated by comma perl 逗号分隔变量块等于另一个逗号分隔值块 - perl block of comma separated variables equals another block of comma separated values 遍历HASH的值并转换为逗号分隔的字符串 - Iterate through values of a HASH and convert to comma separated strings 如何从Perl中的逗号分隔值中提取值? - How can I extract a value from comma separated values in Perl? 如何在Perl中使用逗号分隔的单列值联接两个文件 - How to join two file with single columns values separated by comma in perl
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM