简体   繁体   English

如何按组拆分data.table并在列中使用occourences子集?

[英]How to split a data.table by groups and use subset by occourences in a columns?

I have a large dataset, 287046 x 18, that looks like this (only a partial representation): 我有一个大型数据集,287046 x 18,看起来像这样(只是部分表示):

tdf
         geneSymbol     peaks
16         AK056486 Pol2_only
13         AK310751   no_peak
7          BC036251   no_peak
10         DQ575786   no_peak
4          DQ597235   no_peak
5          DQ599768   no_peak
11         DQ599872   no_peak
12         DQ599872   no_peak
2           FAM138F   no_peak
15           FAM41C   no_peak
34116         GAPDH      both
283034        GAPDH Pol2_only
6      LOC100132062   no_peak
9      LOC100133331   no_peak
14     LOC100288069      both
8            M37726   no_peak
3             OR4F5   no_peak
17           SAMD11      both
18           SAMD11      both
19           SAMD11      both
20           SAMD11      both
21           SAMD11      both
22           SAMD11      both
23           SAMD11      both
24           SAMD11      both
25           SAMD11      both
1            WASH7P Pol2_only

What I want to do is extract (1) the geneSymbols that are either "Pol2_only" or "both" and then; 我想要做的是提取(1)geneSymbols,它们是“Pol2_only”或“both”然后; (2) just the geneSymbols that are "Pol2_only" but not "both". (2)只是“Pol2_only”但不是“both”的geneSymbols。 For example, GAPDH would fulfil condition 1 but not 2. 例如,GAPDH将满足条件1但不满足2。

I've tried plyr with something like this (there is an extra condition there, please ignore): 我已经尝试过像这样的事情(那里有一个额外的条件,请忽略):

## grab genes with both peaks 
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)

## grab genes pol2 only peaks 
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)

But it takes a long time and still returns the wrong answer. 但它需要很长时间,仍然会返回错误的答案。 For instance, the answer for 2 is: 例如,2的答案是:

pol2.only.peaks
  geneSymbol     peaks
1   AK056486 Pol2_only
2      GAPDH Pol2_only
3     WASH7P Pol2_only

As you can see GAPDH should not be there. 你可以看到GAPDH不应该在那里。 My implementation in data.table (which is much prefer and thus preferred) also yields the same result: 我在data.table中的实现(更喜欢并因此更喜欢)也产生相同的结果:

filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
   geneSymbol     peaks
1:   AK056486 Pol2_only
2:      GAPDH Pol2_only
3:     WASH7P Pol2_only

The issue seems to be that the subsetting is working on a row-by-row basis whereas, I need it to be applied on the subgroup of geneSymbol as a whole. 问题似乎是子集正在逐行进行,而我需要将它作为一个整体应用于geneSymbol的子组。

Could please help me subset on the group? 可以请帮我分组吗? A data.table solution would be welcome because it is faster but plyr (or even base R) is fine. data.table解决方案将受到欢迎,因为它更快但plyr(甚至基础R)都很好。 A solution that adds an extra column noting the nature of the peak would be perfect. 添加额外列的解决方案注意到峰的性质将是完美的。 This is what I mean: 这就是我的意思:

tdf
         geneSymbol     peaks      newCol
16         AK056486 Pol2_only   Pol2_only
13         AK310751   no_peak     no_peak
7          BC036251   no_peak     no_peak
10         DQ575786   no_peak     no_peak
4          DQ597235   no_peak     no_peak
5          DQ599768   no_peak     no_peak
11         DQ599872   no_peak     no_peak
12         DQ599872   no_peak     no_peak
2           FAM138F   no_peak     no_peak
15           FAM41C   no_peak     no_peak
34116         GAPDH      both        both
283034        GAPDH Pol2_only        both
6      LOC100132062   no_peak     no_peak
9      LOC100133331   no_peak     no_peak
14     LOC100288069      both        both
8            M37726   no_peak     no_peak
3             OR4F5   no_peak     no_peak
17           SAMD11      both        both
18           SAMD11      both        both
19           SAMD11      both        both
20           SAMD11      both        both
21           SAMD11      both        both
22           SAMD11      both        both
23           SAMD11      both        both
24           SAMD11      both        both
25           SAMD11      both        both
1            WASH7P Pol2_only   Pol2_only

Notice again the GAPDH that is now "both" in the 2 rows. 再次注意GAPDH现在是两行中的“两者”。 Here is the data: 这是数据:

dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", 
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", 
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", 
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", 
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", 
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", 
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", 
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both", 
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", 
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")

Thank you! 谢谢!

edit ** I've found a workaround for the problem. 编辑**我找到了解决问题的方法。 The selection was being done row-by-row. 选择是逐行完成的。 All it is needed is a hack, that is, that in the logical vector that is returned ALL values are true. 所需要的只是一个hack,也就是说,在返回的逻辑向量中,所有值都为真。 So here is what I did with the plyr function: 所以这就是我对plyr函数所做的:

ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
  geneSymbol     peaks
1   AK056486 Pol2_only
2     WASH7P Pol2_only

Note the use of all in alongside the conditions. 注意在条件旁边使用all。 Now the results is the expected, that is, "Pol2_only" only (redundancy alert) genes :) What is still left to be done is the implementation in data.table which I tried but failed to do. 现在结果是预期的,也就是说,只有“Pol2_only”(冗余警报)基因:)还有待完成的是data.table中的实现,我尝试但未能做到。 Any help? 有帮助吗?

I have not written an answer to my question in expectation that someone comes along with a better solution in data.table. 我没有写过我的问题的答案,期望有人在data.table中找到更好的解决方案。

As you requested a data.table solution. 当您请求data.table解决方案时。

# set the key to be "peaks
TDF <- data.table(tdf, key = c('geneSymbol','peaks'))

# use unique to get unique combinations, then for each geneSymbol get the first
# match (we have keyed by peak soboth < Pol2_only < no_peak within each 
# geneSymbol )
# then exclude those with "peak == "no_peak")

unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak']

#      geneSymbol     peaks
# 1:     AK056486 Pol2_only
# 2:        GAPDH      both
# 3: LOC100288069      both
# 4:       SAMD11      both
# 5:       WASH7P Pol2_only

You don't need plyr for this. 你不需要plyr。

a <- tdf$geneSymbol[tdf$peaks %in% c("both", "Pol2_only")]
b <- tdf$geneSymbol[tdf$peaks != "Pol2_only"]
result <- setdiff(a, b)

And to make a new column in your data frame: 并在数据框中创建一个新列:

tdf$newcol <- with(tdf, ifelse(geneSymbol %in% result, "Pol2 only",
                        ifelse(geneSymbol %in% a, "both", "no_peak")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM