简体   繁体   English

如何根据R中另一个文件的条件从文件中提取

[英]How to extract from a file based on conditions of another file in R

I have 2 genetic datasets.我有 2 个遗传数据集。 I filter file1 based on two columns in file2.我根据 file2 中的两列过滤 file1。 The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file2 are selected.文件 1 行提取的条件是,仅选择染色体位置比文件 2 中同一染色体上的变体的任何染色体位置大 5000 以上或小 5000 以上的行。 So the columns being conditions are Chromosome Position (Position) and Chromosome (Chrom).所以作为条件的列是染色体位置(Position)和染色体(Chrom)。 For example my data looks like:例如我的数据看起来像:

File 1:文件 1:

Variant      Chrom         Position  
Variant1      2            14000     
Variant2      1            9000              
Variant3      8            37000          
Variant4      1            21000    

File 2:文件2:

Variant      Chrom         Position  
Variant1      1            10000                   
Variant2      1            20000                   
Variant3      8            30000      

Expected output (of variants with a greater than +/-5000 position distance in comparison to any line of file 2 on the same chromosome):预期输出(与同一染色体上文件 2 的任何行相比,位置距离大于 +/-5000 的变异):

Variant     Chromosome        Position
Variant1       2               14000
Variant3       8               37000

#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 that is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.
#Every other file1 variant is within a 5000+/- distance of variants on the same chromosome in file2 so are not kept

I code for this based on an answer given to my previous question ( How to select lines of file based on multiple conditions of another file in R? ).我根据上一个问题的答案( 如何根据 R 中另一个文件的多个条件选择文件行? )对此进行编码。 However the output for this code with my example data provided is only finding 1/2 variants.但是,提供我的示例数据的此代码的输出仅找到 1/2 变体。 I am also trying to perform a proof-of-concept test that this code is indeed running correctly and it seems incorrect as well.我还尝试执行概念验证测试,以证明此代码确实运行正确,而且似乎也不正确。

Here is the code:这是代码:

library(data.table)
dt1<-fread("file1.txt")  
dt2<-fread("file2.txt")   

dt2[, c("low", "high") := .(position - 5000, position  + 5000)]

#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.Variant,
     on = .(chrom, position > low, position < high ) ]

#discard all found matches (match != NA ), and then drop the match-column
df <- dt1[ is.na(match) ][, match := NULL ][]   
fwrite(df, "file3.csv") 

This currently outputs only:目前仅输出:

    Variant Chrom Position
1: Variant1     2    14000

To check that this code further I have tried getting the opposite set of data to compare by inversing the > and < in this code:为了进一步检查此代码,我尝试通过反转此代码中的><来获取要比较的相反数据集:

dt1[ dt2, match := i.Variant,
     on = .(Chrom, Position > low, Position < high ) ]
test1 <- dt1[ is.na(match) ][, match := NULL ][]

dt1[ dt2, match := i.Variant,
     on = .(Chrom, Position < low, Position > high ) ]
test2 <-  dt1[ is.na(match) ][, match := NULL ][]

Both test1 and test2 output identical matches and mismatches when I check with both my actual dataset and the example here.当我检查我的实际数据集和这里的示例时,test1 和 test2 输出相同的匹配和不匹配。 Is there a reason why this would happen that I am missing?是否有原因导致我失踪?

If you change your code to the following, it will provide the following results:如果您将代码更改为以下内容,它将提供以下结果:

    Variant Chrom Position
1: Variant1     2    14000
2: Variant3     8    37000

Code:代码:

library(data.table)
dt1 <- fread("file1.txt")
dt2 <- fread("file2.txt")

dt2[, c("low", "high") := .(Position - 5000, Position + 5000)]
dt1[ dt2, match := i.Variant, on = .(Chrom, Position > low, Position < high)]
df <- dt1[ is.na(match) ][, match := NULL ][]

fwrite(df, "file3.csv")

The reason being, for the most part, was your position and chrom references were not pointing to existing locations.大多数情况下,原因是您的positionchrom参考没有指向现有位置。

WrehBah巴哈

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM