[英]Find and remove rows that are identical in 3 columns and differ in 1
I have binned my data in intervals (of 100000) using 2 different frames: from 0 to 100000 and onwards, and from 50000 to 150000 and onwards. 我使用2个不同的帧(从0到100000及以后,从50000到150000及以后)以间隔(100000)对数据进行分箱。 I then joined both dataframes, using one column as identifier for the frames (represented in column "x100kb"). 然后我加入了两个数据帧,使用一列作为帧的标识符(在“x100kb”列中表示)。
For my purpose, if 2 rows (edit: they don't need to be sequent to each other; since the data is not ordered by "chr" and "x100kb" right now) differ in "x100kb" by 0.5 (preferably comparing whole numbers to their +0.5; eg: 60 to 60.5, 65 to 65.5; etc) but they have the same values in "chr" and "occurrences_norm" and "occurrences_tum"; 为了我的目的,如果2行(编辑:它们不需要彼此连续;因为数据不是由“chr”和“x100kb”现在订购)在“x100kb”中相差0.5(最好比较整数数字为+0.5;例如:60至60.5,65至65.5;等等)但它们在“chr”和“occurrences_norm”和“occurrences_tum”中具有相同的值; then they are equal and I want to remove one of them. 然后他们是平等的,我想删除其中一个。 The only thing coming to mind now are loops, which obviusly is not very productive... 现在唯一想到的就是循环,这显然不是很有效率......
Data example: 数据示例:
chr x100Kb occurrences_norm occurrences_tum fold
19064 chr17 61.5 17 0 14.05333
38799 chr5 526.0 16 0 13.96587
38800 chr5 526.5 16 0 13.96587
39946 chr5 1113.5 16 0 13.96587
2377 chr1 1426.0 15 0 13.87277
21859 chr18 733.5 15 0 13.87277
20538 chr18 24.0 14 0 13.77324
21863 chr18 735.5 14 0 13.77324
37699 chr4 1835.5 14 0 13.77324
39924 chr5 1102.5 14 0 13.77324
21506 chr18 550.5 13 0 13.66633
21862 chr18 735.0 13 0 13.66633
22258 chr19 151.5 13 0 13.66633
38972 chr5 613.0 13 0 13.66633
41707 chr6 194.5 13 0 13.66633
2380 chr1 1427.5 12 0 13.55087
20541 chr18 25.5 12 0 13.55087
21252 chr18 421.0 12 0 13.55087
27384 chr2 2243.0 12 0 13.55087
39990 chr5 1135.5 12 0 13.55087
In the example, the 3rd row would be removed. 在该示例中,将删除第3行。
I read the question in a different way. 我以不同的方式阅读了这个问题。 I thought we need to compare any two sequent rows. 我认为我们需要比较任何两个后续行。 For example, check row 1 & 2, row 2 & 3, and so on. 例如,检查第1行和第2行,第2行和第3行,依此类推。 I also thought that the condition is the difference in x100Kb is 0.5, not large than 0.5. 我还认为条件是x100Kb的差异是0.5,不大于0.5。 I thought running four logical checks, using shift()
, would be one way to achieve the goal. 我认为运行四个逻辑检查,使用shift()
,将是实现目标的一种方法。
setDT(df1)[!((abs(x100Kb - shift(x100Kb, type = "lag", fill = -Inf)) == 0.5) &
(chr == shift(chr, type = "lag")) &
(occurrences_norm == shift(occurrences_norm, type = "lag")) &
(occurrences_tum == shift(occurrences_tum, type = "lag")))
]
# chr x100Kb occurrences_norm occurrences_tum fold
# 1: chr17 61.5 17 0 14.05333
# 2: chr5 526.0 16 0 13.96587
# 3: chr5 1113.5 16 0 13.96587
# 4: chr1 1426.0 15 0 13.87277
# 5: chr18 733.5 15 0 13.87277
# 6: chr18 24.0 14 0 13.77324
# 7: chr18 735.5 14 0 13.77324
# 8: chr4 1835.5 14 0 13.77324
# 9: chr5 1102.5 14 0 13.77324
#10: chr18 550.5 13 0 13.66633
#11: chr18 735.0 13 0 13.66633
#12: chr19 151.5 13 0 13.66633
#13: chr5 613.0 13 0 13.66633
#14: chr6 194.5 13 0 13.66633
#15: chr1 1427.5 12 0 13.55087
#16: chr18 25.5 12 0 13.55087
#17: chr18 421.0 12 0 13.55087
#18: chr2 2243.0 12 0 13.55087
#19: chr5 1135.5 12 0 13.55087
We could also the data.table
我们也可以data.table
library(data.table)
setDT(df1)[df1[, .I[abs(x100Kb - shift(x100Kb, fill = -Inf)) > 0.5] ,
by = .(chr, occurrences_norm, occurrences_tum)]$V1]
# chr x100Kb occurrences_norm occurrences_tum fold
# 1: chr17 61.5 17 0 14.05333
# 2: chr5 526.0 16 0 13.96587
# 3: chr5 1113.5 16 0 13.96587
# 4: chr1 1426.0 15 0 13.87277
# 5: chr18 733.5 15 0 13.87277
# 6: chr18 24.0 14 0 13.77324
# 7: chr18 735.5 14 0 13.77324
# 8: chr4 1835.5 14 0 13.77324
# 9: chr5 1102.5 14 0 13.77324
#10: chr18 550.5 13 0 13.66633
#11: chr18 735.0 13 0 13.66633
#12: chr19 151.5 13 0 13.66633
#13: chr5 613.0 13 0 13.66633
#14: chr6 194.5 13 0 13.66633
#15: chr1 1427.5 12 0 13.55087
#16: chr18 25.5 12 0 13.55087
#17: chr18 421.0 12 0 13.55087
#18: chr2 2243.0 12 0 13.55087
#19: chr5 1135.5 12 0 13.55087
Try this using dplyr
package 使用dplyr
包试试这个
library(dplyr)
df1 %>% group_by(chr,occurrences_norm,occurrences_tum) %>%
mutate(tmp=diff(c(0,x100Kb))) %>% filter(tmp>0.5) %>% select(-tmp)
# chr x100Kb occurrences_norm occurrences_tum fold
# (fctr) (dbl) (int) (int) (dbl)
# 1 chr17 61.5 17 0 14.05333
# 2 chr5 526.0 16 0 13.96587
# 3 chr5 1113.5 16 0 13.96587
# 4 chr1 1426.0 15 0 13.87277
# 5 chr18 733.5 15 0 13.87277
# 6 chr18 24.0 14 0 13.77324
# 7 chr18 735.5 14 0 13.77324
# 8 chr4 1835.5 14 0 13.77324
# 9 chr5 1102.5 14 0 13.77324
# 10 chr18 550.5 13 0 13.66633
# 11 chr18 735.0 13 0 13.66633
# 12 chr19 151.5 13 0 13.66633
# 13 chr5 613.0 13 0 13.66633
# 14 chr6 194.5 13 0 13.66633
# 15 chr1 1427.5 12 0 13.55087
# 16 chr18 25.5 12 0 13.55087
# 17 chr18 421.0 12 0 13.55087
# 18 chr2 2243.0 12 0 13.55087
# 19 chr5 1135.5 12 0 13.55087
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.