简体   繁体   English

如何从R数据框中的特定行的上方和下方提取行?

[英]How can I extract rows from above and below a specific row in an R dataframe?

Currently I'm working with some Fastq sequencing data. 目前,我正在处理一些Fastq测序数据。 I have a dataframe with three columns and hundreds of rows. 我有一个包含三列和几百行的数据框。 The first column contains the raw sequencing reads and the others contain information about those reads. 第一列包含原始测序读数,其他列包含有关这些读数的信息。 I want to return a row with the string "FALSE" in the 3rd column, plus the row directly above this, and two rows directly below it. 我想返回第三列中包含字符串“ FALSE”的行,再加上直接在其上方的行,以及直接在其下方的两行。 I think it is similar to grep -A -B in shell. 我认为它类似于shell中的grep -A -B。

I've looked around and my question is very similar to this one: 我环顾四周,我的问题与此相似:

Returning above and below rows of specific rows in r dataframe 返回r数据帧中特定行的上下行

However, the answers here are based on row-names and not strings within the rows. 但是,这里的答案基于行名,而不是行中的字符串。 My row names are just numbers in numerical order. 我的行名只是数字顺序的数字。

    Fastq Output    BARCODE     Dulplicated
1   ReadName1       NA          NA
2   ReadSeq1        TGTG TTAT   FALSE
3   +               NA          NA
4   Ascii_score1    NA          NA
5   ReadName2       NA          NA
6   ReadSeq2        TGCT TTAT   FALSE
7   +               NA          NA
8   Ascii_score2    NA          NA
9   ReadName3       NA          NA
10  ReadSeq3        TGCT TTAT   TRUE
11  +               NA          NA
12  Ascii_score3    NA          NA

If the duplicated column has character values. 如果duplicated列具有字符值。 You can do 你可以做

inds <- which(df$Dulplicated == "FALSE")
df[sort(unique(c(inds, inds - 1, inds + 1, inds + 2))), ]

#   FastqOutput  BARCODE Dulplicated
#1    ReadName1     <NA>          NA
#2     ReadSeq1 TGTGTTAT       FALSE
#3            +     <NA>          NA
#4 Ascii_score1     <NA>          NA
#5    ReadName2     <NA>          NA
#6     ReadSeq2 TGCTTTAT       FALSE
#7            +     <NA>          NA
#8 Ascii_score2     <NA>          NA

Or similarly using dplyr::slice 或类似地使用dplyr::slice

library(dplyr)
df %>% slice(sort(unique(c(inds, inds - 1, inds + 1, inds + 2))))

data 数据

df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L, 
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1", 
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3", 
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE = 
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT", 
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA, 
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

We can use data.table 我们可以使用data.table

library(data.table)
setDT(df)[df[, {i1 <-.I[which(!as.logical(Dulplicated))]
             sort(unique(i1+ rep((-2:2), length(i1)))) }]]
#    FastqOutput  BARCODE Dulplicated
#1:    ReadName1     <NA>          NA
#2:     ReadSeq1 TGTGTTAT       FALSE
#3:            +     <NA>          NA
#4: Ascii_score1     <NA>          NA
#5:    ReadName2     <NA>          NA
#6:     ReadSeq2 TGCTTTAT       FALSE
#7:            +     <NA>          NA
#8: Ascii_score2     <NA>          NA

Or it can bee written more compactly 或者可以更紧凑地写

setDT(df)[df[, Reduce(`|`, shift(!as.logical(Dulplicated), n = -2:2))]]

data 数据

df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L, 
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1", 
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3", 
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE = 
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT", 
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA, 
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM