简体   繁体   English

如何找到具有相同值的三个连续行

[英]How to find three consecutive rows with the same value

I have a dataframe as follows: 我有一个数据帧如下:

chr     leftPos    Sample1  X.DD   3_samples    MyStuff
1        324         -1        1        1           1
1        4565        -1        0        0           0 
1        6887        -1        1        0           0
1        12098        1       -1        1           1
2        12          -1        1        0           1
2        43          -1        1        1           1
5        1           -1        1        1           0
5        43           0        1       -1           0
5        6554         1        1        1           1
5        7654        -1        0        0           0
5        8765         1        1        1           0
5        9833         1        1        1          -1
6        12           1        1        0           0
6        43           0        0        0           0
6        56           1        0        0           0
6        79           1        0       -1           0
6        767          1        0       -1           0
6        3233         1        0       -1           0

I would like to convert it according to the following rules For each chromosome: 我想根据以下规则将其转换为每条染色体:

a. 一个。 If there are three or more 1's or -1's consecutively in a column then the value stays as it is. 如果一列中连续有三个或更多1或-1,则该值保持不变。

b. If there are less than three 1's or -1s consecutively in a column then the value of the 1 or -1 changes to 0 如果一列中连续少于三个1或-1,则1或-1的值变为0

The rows in a column have to have the same sign (+ or -ve) to be called consecutive. 列中的行必须具有相同的符号(+或-ve)才能称为连续符号。

The result of the dataframe above should be: 上面数据帧的结果应该是:

chr     leftPos    Sample1  X.DD   3_samples    MyStuff
    1        324         -1        0        0           0
    1        4565        -1        0        0           0 
    1        6887        -1        0        0           0
    1        12098        0        0        0           0
    2        12           0        0        0           0
    2        43           0        0        0           0
    5        1            0        1        0           0
    5        43           0        1        0           0
    5        6554         0        1        0           0
    5        7654         0        0        0           0
    5        8765         0        0        0           0
    5        9833         0        0        0           0
    6        12           0        0        0           0
    6        43           0        0        0           0
    6        56           1        0        0           0
    6        79           1        0       -1           0
    6        767          1        0       -1           0
    6        3233         1        0       -1           0

I have managed to do this for two consecutive rows but I'm not sure how to change this for three or more rows. 我已经设法连续两行,但我不知道如何更改三行或更多行。

DAT_list2res <-cbind(DAT_list2[1:2],DAT_list2res)
colnames(DAT_list2res)[1:2]<-c("chr","leftPos")
DAT_list2res$chr<-as.numeric(gsub("chr","",DAT_list2res$chr))
DAT_list2res<-as.data.frame(DAT_list2res)
dx<-DAT_list2res
f0 <- function( colNr, dx)
{
  col <- dx[,colNr]
  n1 <- which(col == 1| col == -1)          # The `1`-rows.
  d0 <- which( diff(col) == 0)      # Consecutive rows in a column are equal.
  dc0 <- which( diff(dx[,1]) == 0)  # Same chromosome.
  m <- intersect( n1-1, intersect( d0, dc0 ) )
  return ( setdiff( 1:nrow(dx), union(m,m+1) ) )
}
g <- function( dx )
{
  for ( i in 3:ncol(dx) ) { dx[f0(i,dx),i] <- 0 }  
  return ( dx )
}
dx<-g(dx)

Here is one solution only using base R . 这是仅使用基础R一种解决方案。

First define a function that will replace any repetitions which are less than 3 for zeros: 首先定义一个函数,它将替换零的任何小于3的重复:

replace_f <- function(x){
  subs <- rle(x)
  subs$values[subs$lengths < 3] <- 0
  inverse.rle(subs)
}

Then split your data.frame by chr and then apply the function to all columns that you want to change (in this case columns 3 to 6): 然后按chr拆分data.frame ,然后将该函数应用于要更改的所有列(在本例中为第3列到第6列):

df[,3:6] <- do.call("rbind", lapply(split(df[,3:6], df$chr), function(x) apply(x, 2, replace_f)))

Notice that we combine the results together with rbind before replacing the original data. 请注意,在替换原始数据之前,我们将结果与rbind组合在一起。 This will give you the desired result: 这将为您提供所需的结果:

   chr leftPos Sample1 X.DD X3_samples MyStuff
1    1     324      -1    0          0       0
2    1    4565      -1    0          0       0
3    1    6887      -1    0          0       0
4    1   12098       0    0          0       0
5    2      12       0    0          0       0
6    2      43       0    0          0       0
7    5       1       0    1          0       0
8    5      43       0    1          0       0
9    5    6554       0    1          0       0
10   5    7654       0    0          0       0
11   5    8765       0    0          0       0
12   5    9833       0    0          0       0
13   6      12       0    0          0       0
14   6      43       0    0          0       0
15   6      56       1    0          0       0
16   6      79       1    0         -1       0
17   6     767       1    0         -1       0
18   6    3233       1    0         -1       0

A data.table solution using rleid would be 使用rleiddata.table解决方案将是

require(data.table)
setDT(dat)
dat[,Sample1 := Sample1 * as.integer(.N>=3), by=.(chr, rleid(Sample1))]

This used the grouping by rleid(Sample1) and data.table 's helpful .N -variable. 这使用了rleid(Sample1)的分组和data.table的帮助.N -variable。

Doing it for all columns you could use the eval(parse(text=...)) syntax as follows: 对所有列执行此操作可以使用eval(parse(text=...))语法,如下所示:

for(i in names(dat)[3:6]){
  by_string = paste0("list(chr, rleid(", i, "))")
  def_string = paste0(i, "* as.integer(.N>=3)")
  dat[,(i) := eval(parse(text=def_string)), by=eval(parse(text=by_string))]
}

So it results in: 因此它导致:

> dat[]
    chr leftPos Sample1 X.DD X3_samples MyStuff
 1:   1     324      -1    0          0       0
 2:   1    4565      -1    0          0       0
 3:   1    6887      -1    0          0       0
 4:   1   12098       0    0          0       0
 5:   2      12       0    0          0       0
 6:   2      43       0    0          0       0
 7:   5       1       0    1          0       0
 8:   5      43       0    1          0       0
 9:   5    6554       0    1          0       0
10:   5    7654       0    0          0       0
11:   5    8765       0    0          0       0
12:   5    9833       0    0          0       0
13:   6      12       0    0          0       0
14:   6      43       0    0          0       0
15:   6      56       1    0          0       0
16:   6      79       1    0         -1       0
17:   6     767       1    0         -1       0
18:   6    3233       1    0         -1       0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 R 中找到三个或更多具有相同值的连续日期? - How do I find three or more consecutive dates with the same value in R? 如何在R的数据框中的列中查找和删除一定数量的具有相同连续值的行? - How to find and delete a certain number of rows with the same consecutive value in a column in a dataframe in R? 如何对具有相同事件的连续行进行分组并找到平均值? - How to group consecutive rows having same event and find average? 如何删除data.table中包含相同值的3个连续行 - How to delete 3 consecutive rows that contain the same value in a data.table 如何在R中的每两连续行中找到值的差异? - How to find the difference in value in every two consecutive rows in R? R:仅当键值相同并且键在连续行中重复时,如何对行中的值求和? - R: How to sum values from rows only if the key value is the same and also if the key duplicated in consecutive rows? 找到三个或更多连续的负数并从数据框中删除行 - Find three or more consecutive negative numbers and remove the rows from data frame 在两列或三列中查找包含相同值的行 - Find rows that contain the same values across two or three columns R中连续相同行的总和 - Sum of consecutive same rows in R 如何找到分组的非连续行中的差异? - How to find the difference in non-consecutive rows that are grouped?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM