如何根据R中另一个文件的条件从文件中提取

Question

我有 2 个遗传数据集。 我根据 file2 中的两列过滤 file1。 文件 1 行提取的条件是，仅选择染色体位置比文件 2 中同一染色体上的变体的任何染色体位置大 5000 以上或小 5000 以上的行。 所以作为条件的列是染色体位置（Position）和染色体（Chrom）。 例如我的数据看起来像：

文件 1：

Variant      Chrom         Position  
Variant1      2            14000     
Variant2      1            9000              
Variant3      8            37000          
Variant4      1            21000

文件2：

Variant      Chrom         Position  
Variant1      1            10000                   
Variant2      1            20000                   
Variant3      8            30000

预期输出（与同一染色体上文件 2 的任何行相比，位置距离大于 +/-5000 的变异）：

Variant     Chromosome        Position
Variant1       2               14000
Variant3       8               37000

#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 that is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.
#Every other file1 variant is within a 5000+/- distance of variants on the same chromosome in file2 so are not kept

我根据上一个问题的答案（如何根据 R 中另一个文件的多个条件选择文件行？）对此进行编码。 但是，提供我的示例数据的此代码的输出仅找到 1/2 变体。 我还尝试执行概念验证测试，以证明此代码确实运行正确，而且似乎也不正确。

这是代码：

library(data.table)
dt1<-fread("file1.txt")  
dt2<-fread("file2.txt")   

dt2[, c("low", "high") := .(position - 5000, position  + 5000)]

#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.Variant,
     on = .(chrom, position > low, position < high ) ]

#discard all found matches (match != NA ), and then drop the match-column
df <- dt1[ is.na(match) ][, match := NULL ][]   
fwrite(df, "file3.csv")

目前仅输出：

    Variant Chrom Position
1: Variant1     2    14000

为了进一步检查此代码，我尝试通过反转此代码中的>和<来获取要比较的相反数据集：

dt1[ dt2, match := i.Variant,
     on = .(Chrom, Position > low, Position < high ) ]
test1 <- dt1[ is.na(match) ][, match := NULL ][]

dt1[ dt2, match := i.Variant,
     on = .(Chrom, Position < low, Position > high ) ]
test2 <-  dt1[ is.na(match) ][, match := NULL ][]

当我检查我的实际数据集和这里的示例时，test1 和 test2 输出相同的匹配和不匹配。 是否有原因导致我失踪？

Answer 1

如果您将代码更改为以下内容，它将提供以下结果：

    Variant Chrom Position
1: Variant1     2    14000
2: Variant3     8    37000

代码：

library(data.table)
dt1 <- fread("file1.txt")
dt2 <- fread("file2.txt")

dt2[, c("low", "high") := .(Position - 5000, Position + 5000)]
dt1[ dt2, match := i.Variant, on = .(Chrom, Position > low, Position < high)]
df <- dt1[ is.na(match) ][, match := NULL ][]

fwrite(df, "file3.csv")

大多数情况下，原因是您的position和chrom参考没有指向现有位置。

巴哈

如何根据R中另一个文件的条件从文件中提取

问题描述

1 个解决方案

解决方案1
2 2020-02-07 14:30:45

如何根据R中另一个文件的条件从文件中提取

问题描述

1 个解决方案

解决方案1 2 2020-02-07 14:30:45

解决方案1
2 2020-02-07 14:30:45