简体   繁体   English

从其他 data.table 中涂抹一个 data.table 列 function

[英]sapply over one data.table column function from other data.table

I am new to data.table, and am trying to switch over.我是 data.table 的新手,正在尝试切换。 I have 2 data.tables ( variable_sites and dt_bam ) and want to use variable_sites$POS (call this refPOS ) to perform a function using data from dt_bam .我有 2 个 data.tables ( variable_sitesdt_bam )并想使用variable_sites$POS (称为refPOS )使用来自 dt_bam 的数据执行dt_bam To get the variable read_base in the summary table, I want to find a row in dt_bam where refPOS is less than pos + qwidth and extract a character from the string dt_bam$seq based on the difference between refPOS and pos要在汇总表中获取变量read_base ,我想在dt_bam中找到refPOS小于pos + qwidth的行,并根据refPOSpos之间的差异从字符串dt_bam$seq中提取一个字符

I have it working for one single value of refPOS but don't really know how to sapply a vector of refPOS s in the data.table syntax.我让它为refPOS的一个值工作,但真的不知道如何在refPOS语法中应用sapply向量。 Any help is appreciated.任何帮助表示赞赏。

Here is my code:这是我的代码:

dt_bam<-data.table(qname=lst[[1]],rname=lst[[2]],strand=lst[[3]],pos=lst[[4]],qwidth=lst[[5]],cigar=lst[[6]],
                   seq=as.character(lst[[7]]))
refPOS<-1000140 # renamed POS so not to confuse with pos
summ_tab <-  dt_bam[refPOS < pos +qwidth & refPOS >pos,
                    .(locus_pos=refPOS,read_base = substr(seq,abs(refPOS-pos),abs(refPOS-pos)))] 

# sapply(variable_sites[,POS],) then the individual values from variable_sites[POS] become refPOS

expected output, as below but one row for every row in dt1 variable_sites[,POS]:预期 output,如下所示,但 dt1 variable_sites[,POS] 中的每一行都有一行:

    refPOS read_base
1: 1000140         C

Here is some sample data:以下是一些示例数据:

> head(variable_sites)
    CHR     POS REF
1: chr1 1013855   G
2: chr1 1045080   G
3: chr1 1051873   C
4: chr1 1083795   C
5: chr1 1091327   C
6: chr1 1091421   T    

> head(dt_bam)
                qname rname strand     pos qwidth  cigar
1: SRR709972.27609810  chr1      + 1000135    101   101M
2: SRR709972.27609810  chr1      - 1000145    101   101M
3: SRR709972.23678227  chr1      + 1000545    101 91M10S
4: SRR709972.23678227  chr1      - 1000632    101   101M
5: SRR709972.11643848  chr1      + 1000651    101   101M
6: SRR709972.18299955  chr1      + 1000669    101   101M
                                                                                                     seq
1: GCCGCGGGGTGTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGC
2: GTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGCG
3: CGAGCCTCGGTCTCGAGCCTCTTGGCTTCCTCCGCCCTTCCCCACTCCGGTCCCGGTTTGGGCCCTGCTCTGTCTCCGAGTTTGATCCGACCCCGCCTCGC
4: CGACACCGGCTCGGCCTCCGGGGGTCCCCCCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTG
5: GGGGGTCCCACCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCGGCCGGGTCGGCAGGCG
6: TGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCTGCCGGGTCGGCAGGCGGGAGGGCGGAGTCAGCGG

> dput(head(variable_sites))
setDT(structure(list(CHR = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1"), POS = c(1013855L, 1045080L, 1051873L, 1083795L, 1091327L, 
1091421L), REF = c("G", "G", "C", "C", "C", "T")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame")))

This is the data.table approach you are looking for.这是您正在寻找的 data.table 方法。 We create a temporary variable end in dt_bam and then perform a non-equi join.我们在dt_bam中创建一个临时变量end ,然后执行非 equi 连接。 Note that when performing the join, you MUST use x.POS to refer to variable_sites$POS .请注意,在执行连接时,您必须使用x.POS来引用variable_sites$POS POS will give you the wrong variable. POS会给你错误的变量。 i.pos / pos / POS all refer to dt_bam$pos , as by default the variable you are joining on ( POS in this case) is replaced by the first corresponding variable ( pos in this case) in the data.table joined with. i.pos / pos / POS都指dt_bam$pos ,因为默认情况下,您要加入的变量(在本例中为POS )被 data.table 中的第一个相应变量(在本例中为pos )替换。

library(data.table)

variable_sites[
  dt_bam[, end:=pos+qwidth], read_base:=substr(seq, x.POS - i.pos, x.POS - i.pos), 
  on = .(POS > pos, POS < end)
]
dt_bam[, end:=NULL]

Output Output

> variable_sites
    CHR     POS REF read_base
1: chr1 1013855   G      <NA>
2: chr1 1045080   G      <NA>
3: chr1 1051873   C      <NA>
4: chr1 1083795   C      <NA>
5: chr1 1091327   C      <NA>
6: chr1 1091421   T      <NA>
7: chr1 1000140   ?         C

Data数据

variable_sites <- data.table::setDT(structure(list(CHR = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1"), POS = c(1013855L, 1045080L, 1051873L, 1083795L, 
1091327L, 1091421L, 1000140L), REF = c("G", "G", "C", "C", "C", 
"T", "?")), row.names = c(NA, -7L), class = c("data.table", "data.frame")))

dt_bam <- data.table::setDT(structure(list(qname = c("SRR709972.27609810", "SRR709972.27609810", 
"SRR709972.23678227", "SRR709972.23678227", "SRR709972.11643848", 
"SRR709972.18299955"), rname = c("chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1"), strand = c("+", "-", "+", "-", "+", "+"), pos = c(1000135L, 
1000145L, 1000545L, 1000632L, 1000651L, 1000669L), qwidth = c(101L, 
101L, 101L, 101L, 101L, 101L), cigar = c("101M", "101M", "91M10S", 
"101M", "101M", "101M"), seq = c("GCCGCGGGGTGTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGC", 
"GTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGCG", 
"CGAGCCTCGGTCTCGAGCCTCTTGGCTTCCTCCGCCCTTCCCCACTCCGGTCCCGGTTTGGGCCCTGCTCTGTCTCCGAGTTTGATCCGACCCCGCCTCGC", 
"CGACACCGGCTCGGCCTCCGGGGGTCCCCCCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTG", 
"GGGGGTCCCACCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCGGCCGGGTCGGCAGGCG", 
"TGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCTGCCGGGTCGGCAGGCGGGAGGGCGGAGTCAGCGG"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM