繁体   English   中英

如何从R数据框中的特定行的上方和下方提取行?

[英]How can I extract rows from above and below a specific row in an R dataframe?

目前,我正在处理一些Fastq测序数据。 我有一个包含三列和几百行的数据框。 第一列包含原始测序读数,其他列包含有关这些读数的信息。 我想返回第三列中包含字符串“ FALSE”的行,再加上直接在其上方的行,以及直接在其下方的两行。 我认为它类似于shell中的grep -A -B。

我环顾四周,我的问题与此相似:

返回r数据帧中特定行的上下行

但是,这里的答案基于行名,而不是行中的字符串。 我的行名只是数字顺序的数字。

    Fastq Output    BARCODE     Dulplicated
1   ReadName1       NA          NA
2   ReadSeq1        TGTG TTAT   FALSE
3   +               NA          NA
4   Ascii_score1    NA          NA
5   ReadName2       NA          NA
6   ReadSeq2        TGCT TTAT   FALSE
7   +               NA          NA
8   Ascii_score2    NA          NA
9   ReadName3       NA          NA
10  ReadSeq3        TGCT TTAT   TRUE
11  +               NA          NA
12  Ascii_score3    NA          NA

如果duplicated列具有字符值。 你可以做

inds <- which(df$Dulplicated == "FALSE")
df[sort(unique(c(inds, inds - 1, inds + 1, inds + 2))), ]

#   FastqOutput  BARCODE Dulplicated
#1    ReadName1     <NA>          NA
#2     ReadSeq1 TGTGTTAT       FALSE
#3            +     <NA>          NA
#4 Ascii_score1     <NA>          NA
#5    ReadName2     <NA>          NA
#6     ReadSeq2 TGCTTTAT       FALSE
#7            +     <NA>          NA
#8 Ascii_score2     <NA>          NA

或类似地使用dplyr::slice

library(dplyr)
df %>% slice(sort(unique(c(inds, inds - 1, inds + 1, inds + 2))))

数据

df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L, 
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1", 
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3", 
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE = 
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT", 
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA, 
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

我们可以使用data.table

library(data.table)
setDT(df)[df[, {i1 <-.I[which(!as.logical(Dulplicated))]
             sort(unique(i1+ rep((-2:2), length(i1)))) }]]
#    FastqOutput  BARCODE Dulplicated
#1:    ReadName1     <NA>          NA
#2:     ReadSeq1 TGTGTTAT       FALSE
#3:            +     <NA>          NA
#4: Ascii_score1     <NA>          NA
#5:    ReadName2     <NA>          NA
#6:     ReadSeq2 TGCTTTAT       FALSE
#7:            +     <NA>          NA
#8: Ascii_score2     <NA>          NA

或者可以更紧凑地写

setDT(df)[df[, Reduce(`|`, shift(!as.logical(Dulplicated), n = -2:2))]]

数据

df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L, 
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1", 
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3", 
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE = 
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT", 
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA, 
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM