[英]How can I extract rows from above and below a specific row in an R dataframe?
Currently I'm working with some Fastq sequencing data. 目前,我正在处理一些Fastq测序数据。 I have a dataframe with three columns and hundreds of rows.
我有一个包含三列和几百行的数据框。 The first column contains the raw sequencing reads and the others contain information about those reads.
第一列包含原始测序读数,其他列包含有关这些读数的信息。 I want to return a row with the string "FALSE" in the 3rd column, plus the row directly above this, and two rows directly below it.
我想返回第三列中包含字符串“ FALSE”的行,再加上直接在其上方的行,以及直接在其下方的两行。 I think it is similar to grep -A -B in shell.
我认为它类似于shell中的grep -A -B。
I've looked around and my question is very similar to this one: 我环顾四周,我的问题与此相似:
Returning above and below rows of specific rows in r dataframe 返回r数据帧中特定行的上下行
However, the answers here are based on row-names and not strings within the rows. 但是,这里的答案基于行名,而不是行中的字符串。 My row names are just numbers in numerical order.
我的行名只是数字顺序的数字。
Fastq Output BARCODE Dulplicated
1 ReadName1 NA NA
2 ReadSeq1 TGTG TTAT FALSE
3 + NA NA
4 Ascii_score1 NA NA
5 ReadName2 NA NA
6 ReadSeq2 TGCT TTAT FALSE
7 + NA NA
8 Ascii_score2 NA NA
9 ReadName3 NA NA
10 ReadSeq3 TGCT TTAT TRUE
11 + NA NA
12 Ascii_score3 NA NA
If the duplicated
column has character values. 如果
duplicated
列具有字符值。 You can do 你可以做
inds <- which(df$Dulplicated == "FALSE")
df[sort(unique(c(inds, inds - 1, inds + 1, inds + 2))), ]
# FastqOutput BARCODE Dulplicated
#1 ReadName1 <NA> NA
#2 ReadSeq1 TGTGTTAT FALSE
#3 + <NA> NA
#4 Ascii_score1 <NA> NA
#5 ReadName2 <NA> NA
#6 ReadSeq2 TGCTTTAT FALSE
#7 + <NA> NA
#8 Ascii_score2 <NA> NA
Or similarly using dplyr::slice
或类似地使用
dplyr::slice
library(dplyr)
df %>% slice(sort(unique(c(inds, inds - 1, inds + 1, inds + 2))))
data 数据
df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L,
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1",
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3",
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE =
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT",
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA,
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
We can use data.table
我们可以使用
data.table
library(data.table)
setDT(df)[df[, {i1 <-.I[which(!as.logical(Dulplicated))]
sort(unique(i1+ rep((-2:2), length(i1)))) }]]
# FastqOutput BARCODE Dulplicated
#1: ReadName1 <NA> NA
#2: ReadSeq1 TGTGTTAT FALSE
#3: + <NA> NA
#4: Ascii_score1 <NA> NA
#5: ReadName2 <NA> NA
#6: ReadSeq2 TGCTTTAT FALSE
#7: + <NA> NA
#8: Ascii_score2 <NA> NA
Or it can bee written more compactly 或者可以更紧凑地写
setDT(df)[df[, Reduce(`|`, shift(!as.logical(Dulplicated), n = -2:2))]]
df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L,
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1",
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3",
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE =
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT",
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA,
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.