I have 2 genetic datasets. I filter file1 based on two columns in file2. The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file2 are selected. So the columns being conditions are Chromosome Position (Position) and Chromosome (Chrom). For example my data looks like:
File 1:
Variant Chrom Position
Variant1 2 14000
Variant2 1 9000
Variant3 8 37000
Variant4 1 21000
File 2:
Variant Chrom Position
Variant1 1 10000
Variant2 1 20000
Variant3 8 30000
Expected output (of variants with a greater than +/-5000 position distance in comparison to any line of file 2 on the same chromosome):
Variant Chromosome Position
Variant1 2 14000
Variant3 8 37000
#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 that is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.
#Every other file1 variant is within a 5000+/- distance of variants on the same chromosome in file2 so are not kept
I code for this based on an answer given to my previous question ( How to select lines of file based on multiple conditions of another file in R? ). However the output for this code with my example data provided is only finding 1/2 variants. I am also trying to perform a proof-of-concept test that this code is indeed running correctly and it seems incorrect as well.
Here is the code:
library(data.table)
dt1<-fread("file1.txt")
dt2<-fread("file2.txt")
dt2[, c("low", "high") := .(position - 5000, position + 5000)]
#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.Variant,
on = .(chrom, position > low, position < high ) ]
#discard all found matches (match != NA ), and then drop the match-column
df <- dt1[ is.na(match) ][, match := NULL ][]
fwrite(df, "file3.csv")
This currently outputs only:
Variant Chrom Position
1: Variant1 2 14000
To check that this code further I have tried getting the opposite set of data to compare by inversing the >
and <
in this code:
dt1[ dt2, match := i.Variant,
on = .(Chrom, Position > low, Position < high ) ]
test1 <- dt1[ is.na(match) ][, match := NULL ][]
dt1[ dt2, match := i.Variant,
on = .(Chrom, Position < low, Position > high ) ]
test2 <- dt1[ is.na(match) ][, match := NULL ][]
Both test1 and test2 output identical matches and mismatches when I check with both my actual dataset and the example here. Is there a reason why this would happen that I am missing?
If you change your code to the following, it will provide the following results:
Variant Chrom Position
1: Variant1 2 14000
2: Variant3 8 37000
Code:
library(data.table)
dt1 <- fread("file1.txt")
dt2 <- fread("file2.txt")
dt2[, c("low", "high") := .(Position - 5000, Position + 5000)]
dt1[ dt2, match := i.Variant, on = .(Chrom, Position > low, Position < high)]
df <- dt1[ is.na(match) ][, match := NULL ][]
fwrite(df, "file3.csv")
The reason being, for the most part, was your position
and chrom
references were not pointing to existing locations.
WrehBah
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.