简体   繁体   中英

How to extract from a file based on conditions of another file in R

I have 2 genetic datasets. I filter file1 based on two columns in file2. The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file2 are selected. So the columns being conditions are Chromosome Position (Position) and Chromosome (Chrom). For example my data looks like:

File 1:

Variant      Chrom         Position  
Variant1      2            14000     
Variant2      1            9000              
Variant3      8            37000          
Variant4      1            21000    

File 2:

Variant      Chrom         Position  
Variant1      1            10000                   
Variant2      1            20000                   
Variant3      8            30000      

Expected output (of variants with a greater than +/-5000 position distance in comparison to any line of file 2 on the same chromosome):

Variant     Chromosome        Position
Variant1       2               14000
Variant3       8               37000

#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 that is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.
#Every other file1 variant is within a 5000+/- distance of variants on the same chromosome in file2 so are not kept

I code for this based on an answer given to my previous question ( How to select lines of file based on multiple conditions of another file in R? ). However the output for this code with my example data provided is only finding 1/2 variants. I am also trying to perform a proof-of-concept test that this code is indeed running correctly and it seems incorrect as well.

Here is the code:

library(data.table)
dt1<-fread("file1.txt")  
dt2<-fread("file2.txt")   

dt2[, c("low", "high") := .(position - 5000, position  + 5000)]

#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.Variant,
     on = .(chrom, position > low, position < high ) ]

#discard all found matches (match != NA ), and then drop the match-column
df <- dt1[ is.na(match) ][, match := NULL ][]   
fwrite(df, "file3.csv") 

This currently outputs only:

    Variant Chrom Position
1: Variant1     2    14000

To check that this code further I have tried getting the opposite set of data to compare by inversing the > and < in this code:

dt1[ dt2, match := i.Variant,
     on = .(Chrom, Position > low, Position < high ) ]
test1 <- dt1[ is.na(match) ][, match := NULL ][]

dt1[ dt2, match := i.Variant,
     on = .(Chrom, Position < low, Position > high ) ]
test2 <-  dt1[ is.na(match) ][, match := NULL ][]

Both test1 and test2 output identical matches and mismatches when I check with both my actual dataset and the example here. Is there a reason why this would happen that I am missing?

If you change your code to the following, it will provide the following results:

    Variant Chrom Position
1: Variant1     2    14000
2: Variant3     8    37000

Code:

library(data.table)
dt1 <- fread("file1.txt")
dt2 <- fread("file2.txt")

dt2[, c("low", "high") := .(Position - 5000, Position + 5000)]
dt1[ dt2, match := i.Variant, on = .(Chrom, Position > low, Position < high)]
df <- dt1[ is.na(match) ][, match := NULL ][]

fwrite(df, "file3.csv")

The reason being, for the most part, was your position and chrom references were not pointing to existing locations.

WrehBah

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM