简体   繁体   English

查找和删除3列中相同且不同的行1

[英]Find and remove rows that are identical in 3 columns and differ in 1

I have binned my data in intervals (of 100000) using 2 different frames: from 0 to 100000 and onwards, and from 50000 to 150000 and onwards. 我使用2个不同的帧(从0到100000及以后,从50000到150000及以后)以间隔(100000)对数据进行分箱。 I then joined both dataframes, using one column as identifier for the frames (represented in column "x100kb"). 然后我加入了两个数据帧,使用一列作为帧的标识符(在“x100kb”列中表示)。

For my purpose, if 2 rows (edit: they don't need to be sequent to each other; since the data is not ordered by "chr" and "x100kb" right now) differ in "x100kb" by 0.5 (preferably comparing whole numbers to their +0.5; eg: 60 to 60.5, 65 to 65.5; etc) but they have the same values in "chr" and "occurrences_norm" and "occurrences_tum"; 为了我的目的,如果2行(编辑:它们不需要彼此连续;因为数据不是由“chr”和“x100kb”现在订购)在“x100kb”中相差0.5(最好比较整数数字为+0.5;例如:60至60.5,65至65.5;等等)但它们在“chr”和“occurrences_norm”和“occurrences_tum”中具有相同的值; then they are equal and I want to remove one of them. 然后他们是平等的,我想删除其中一个。 The only thing coming to mind now are loops, which obviusly is not very productive... 现在唯一想到的就是循环,这显然不是很有效率......

Data example: 数据示例:

       chr    x100Kb occurrences_norm    occurrences_tum   fold
19064 chr17   61.5               17               0 14.05333
38799  chr5  526.0               16               0 13.96587
38800  chr5  526.5               16               0 13.96587
39946  chr5 1113.5               16               0 13.96587
2377   chr1 1426.0               15               0 13.87277
21859 chr18  733.5               15               0 13.87277
20538 chr18   24.0               14               0 13.77324
21863 chr18  735.5               14               0 13.77324
37699  chr4 1835.5               14               0 13.77324
39924  chr5 1102.5               14               0 13.77324
21506 chr18  550.5               13               0 13.66633
21862 chr18  735.0               13               0 13.66633
22258 chr19  151.5               13               0 13.66633
38972  chr5  613.0               13               0 13.66633
41707  chr6  194.5               13               0 13.66633
2380   chr1 1427.5               12               0 13.55087
20541 chr18   25.5               12               0 13.55087
21252 chr18  421.0               12               0 13.55087
27384  chr2 2243.0               12               0 13.55087
39990  chr5 1135.5               12               0 13.55087

In the example, the 3rd row would be removed. 在该示例中,将删除第3行。

I read the question in a different way. 我以不同的方式阅读了这个问题。 I thought we need to compare any two sequent rows. 我认为我们需要比较任何两个后续行。 For example, check row 1 & 2, row 2 & 3, and so on. 例如,检查第1行和第2行,第2行和第3行,依此类推。 I also thought that the condition is the difference in x100Kb is 0.5, not large than 0.5. 我还认为条件是x100Kb的差异是0.5,不大于0.5。 I thought running four logical checks, using shift() , would be one way to achieve the goal. 我认为运行四个逻辑检查,使用shift() ,将是实现目标的一种方法。

setDT(df1)[!((abs(x100Kb - shift(x100Kb, type = "lag", fill = -Inf)) == 0.5) &
             (chr == shift(chr, type = "lag")) &
             (occurrences_norm == shift(occurrences_norm, type = "lag")) &
             (occurrences_tum == shift(occurrences_tum, type = "lag")))
           ]

#      chr x100Kb occurrences_norm occurrences_tum     fold
# 1: chr17   61.5               17               0 14.05333
# 2:  chr5  526.0               16               0 13.96587
# 3:  chr5 1113.5               16               0 13.96587
# 4:  chr1 1426.0               15               0 13.87277
# 5: chr18  733.5               15               0 13.87277
# 6: chr18   24.0               14               0 13.77324
# 7: chr18  735.5               14               0 13.77324
# 8:  chr4 1835.5               14               0 13.77324
# 9:  chr5 1102.5               14               0 13.77324
#10: chr18  550.5               13               0 13.66633
#11: chr18  735.0               13               0 13.66633
#12: chr19  151.5               13               0 13.66633
#13:  chr5  613.0               13               0 13.66633
#14:  chr6  194.5               13               0 13.66633
#15:  chr1 1427.5               12               0 13.55087
#16: chr18   25.5               12               0 13.55087
#17: chr18  421.0               12               0 13.55087
#18:  chr2 2243.0               12               0 13.55087
#19:  chr5 1135.5               12               0 13.55087

We could also the data.table 我们也可以data.table

library(data.table)
setDT(df1)[df1[,  .I[abs(x100Kb - shift(x100Kb, fill = -Inf)) > 0.5]  , 
                  by =  .(chr, occurrences_norm, occurrences_tum)]$V1]
#      chr x100Kb occurrences_norm occurrences_tum     fold
# 1: chr17   61.5               17               0 14.05333
# 2:  chr5  526.0               16               0 13.96587
# 3:  chr5 1113.5               16               0 13.96587
# 4:  chr1 1426.0               15               0 13.87277
# 5: chr18  733.5               15               0 13.87277
# 6: chr18   24.0               14               0 13.77324
# 7: chr18  735.5               14               0 13.77324
# 8:  chr4 1835.5               14               0 13.77324
# 9:  chr5 1102.5               14               0 13.77324
#10: chr18  550.5               13               0 13.66633
#11: chr18  735.0               13               0 13.66633
#12: chr19  151.5               13               0 13.66633
#13:  chr5  613.0               13               0 13.66633
#14:  chr6  194.5               13               0 13.66633
#15:  chr1 1427.5               12               0 13.55087
#16: chr18   25.5               12               0 13.55087
#17: chr18  421.0               12               0 13.55087
#18:  chr2 2243.0               12               0 13.55087
#19:  chr5 1135.5               12               0 13.55087

Try this using dplyr package 使用dplyr包试试这个

library(dplyr)
df1 %>% group_by(chr,occurrences_norm,occurrences_tum) %>% 
mutate(tmp=diff(c(0,x100Kb))) %>% filter(tmp>0.5) %>% select(-tmp)

# chr x100Kb occurrences_norm occurrences_tum     fold
# (fctr)  (dbl)            (int)           (int)    (dbl)
# 1   chr17   61.5               17               0 14.05333
# 2    chr5  526.0               16               0 13.96587
# 3    chr5 1113.5               16               0 13.96587
# 4    chr1 1426.0               15               0 13.87277
# 5   chr18  733.5               15               0 13.87277
# 6   chr18   24.0               14               0 13.77324
# 7   chr18  735.5               14               0 13.77324
# 8    chr4 1835.5               14               0 13.77324
# 9    chr5 1102.5               14               0 13.77324
# 10  chr18  550.5               13               0 13.66633
# 11  chr18  735.0               13               0 13.66633
# 12  chr19  151.5               13               0 13.66633
# 13   chr5  613.0               13               0 13.66633
# 14   chr6  194.5               13               0 13.66633
# 15   chr1 1427.5               12               0 13.55087
# 16  chr18   25.5               12               0 13.55087
# 17  chr18  421.0               12               0 13.55087
# 18   chr2 2243.0               12               0 13.55087
# 19   chr5 1135.5               12               0 13.55087

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM