[英]Perl (or R, or SQL): Count how often string appears across columns
我有一個看起來像這樣的文本文件:
gene1 gene2 gene3
a d c
b e d
c f g
d g
h
i
(每列都是一個人類基因,每個都包含可變數量的蛋白質(字符串,這里顯示為字母),可以與這些基因結合)。
我想要做的是計算每個字符串表示的列數,輸出該數字和所有列標題,如下所示:
a 1 gene1
b 1 gene1
c 2 gene1 gene3
d 3 gene1 gene2 gene3
e 1 gene2
f 1 gene2
g 2 gene2 gene3
h 1 gene2
i 1 gene2
我一直試圖弄清楚如何在Perl和R中做到這一點,但到目前為止還沒有成功。 謝謝你的幫助。
這個解決方案看起來有點像黑客,但它提供了所需的輸出。 它依賴於同時使用plyr
和reshape
包,但我確信你可以找到基本的R替代品。 訣竅在於,函數melt
讓我們將數據展平為長格式,從而可以實現從那一點開始的簡單(ish)操作。
library(reshape)
library(plyr)
#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
gene2 = letters[4:9],
gene3 = c("c", "d", "g", NA, NA, NA)
)
#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)
#Tabulate counts
counts <- as.data.frame(table(dat.m$value))
#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))
#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")
得到:
> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Var1 Freq V1
1 a 1 gene1
2 b 1 gene1
3 c 2 gene1 gene3
4 d 3 gene1 gene2 gene3
....
在perl中,假設每列中的蛋白質不具有需要去除的重復。 (如果他們這樣做,則應該使用散列哈希值。)
use strict;
use warnings;
my $header = <>;
my %column_genes;
while ($header =~ /(\S+)/g) {
$column_genes{$-[1]} = "$1";
}
my %proteins;
while (my $line = <>) {
while ($line =~ /(\S+)/g) {
if (exists $column_genes{$-[1]}) {
push @{ $proteins{$1} }, $column_genes{$-[1]};
}
else {
warn "line $. column $-[1] unexpected protein $1 ignored\n";
}
}
}
for my $protein (sort keys %proteins) {
print join("\t",
$protein,
scalar @{ $proteins{$protein} },
join(' ', sort @{ $proteins{$protein} } )
), "\n";
}
從stdin讀取,寫入stdout。
一個襯墊(或更確切地說是3個襯墊)
ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize,
len = length(variable),
var = paste(variable, collapse = " "))
如果它不是很多列,你可以在sql中做這樣的事情。 您基本上將數據壓縮成2列衍生的蛋白質/基因表,然后根據需要進行總結。
;with cte as (
select gene1 as protein, 'gene1' as gene
union select gene2 as protein, 'gene2' as gene
union select gene3 as protein, 'gene3' as gene
)
select protein, count(*) as cnt, group_concat(gene) as gene
from cte
group by protein
在mysql中,像這樣:
select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein;
假設數據如下:
create table gene_protein (gene varchar(255) not null, protein varchar(255) not null);
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d');
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i');
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g');
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.