简体   繁体   English

基于特定列上的 rep 函数在 R 中的行中查找序列

[英]Finding sequences in rows in R based on the rep function on a certain column


I'm trying to find a sequence of 0's in a row based on the rep function of a certain column.我试图根据某一列的 rep 函数在一行中找到一个 0 的序列。 Below is my best attempt so far which throws an error.以下是我迄今为止最好的尝试,但会引发错误。 I tried using an apply loop but failed miserably and I don't really want to use a for loop unless I have to as my true dataset is about 800,000 rows.我尝试使用 apply 循环但失败得很惨,我真的不想使用 for 循环,除非我必须这样做,因为我的真实数据集大约有 800,000 行。 I have tried looking up solutions but can't seem to find anything and have spent a few hours at this and had no luck.我曾尝试查找解决方案,但似乎找不到任何东西,并且在这方面花了几个小时但没有运气。 I have also attached the desired output.我还附上了所需的输出。

library(data.table)

TEST_DF <- data.table(INDEX = c(1,2,3,4),
                      COL_1 = c(0,0,0,0),
                      COL_2 = c(0,0,2,5),
                      COL_3 = c(0,0,0,0),
                      COL_4 = c(0,2,0,1),
                      DAYS  = c(4,4,2,2))

IN_FUN <- function(x, y)
{
  x <- rle(x)

  if( max(as.numeric(x$lengths[x$values == 0])) >= y )
  {
    "Y"
  }
  else
  {
    "N"
  }
}

TEST_DF$DEFINITION <- apply(TEST_DF[, c(2:5), with = FALSE], 1, 
                            FUN = IN_FUN(TEST_DF[, c(2:5), with = FALSE], TEST_DF$DAYS))

DESIRED <- TEST_DF <- data.table(P_ID = c(1,2,3,4),
                                 COL_1 = c(0,0,0,0),
                                 COL_2 = c(0,0,2,5),
                                 COL_3 = c(0,0,0,0),
                                 COL_4 = c(0,2,0,1),
                                 DAYS  = c(4,4,2,2).
                                 DEFINITION = c("Y","N","Y","N"),
                                 INDEX = c(2,NA,4,NA)

For the first row I want to see if four 0's are within COL_1 to COL_4, four 0's within row 2 and two 0's within rows 3 and 4. Basically the number of 0's is given by the value in the DAYS column.对于第一行,我想看看 COL_1 到 COL_4 中是否有四个 0,第 2 行中有四个 0,第 3 行和第 4 行中有两个 0。基本上,0 的数量由 DAYS 列中的值给出。 So since four 0's are within row 1, DEFINITION gets a value of "Y", row 2 gets a value of "N" since there is only three 0's row 4 should get a value of "Y" since there are two 0's, etc.因此,由于第 1 行中有四个 0,因此 DEFINITION 的值为“Y”,第 2 行的值为“N”,因为只有三个 0 第 4 行的值应为“Y”,因为有两个 0,依此类推.

Also, if possible, if the DEFINITION column has a value of "Y" in it, then it should return the column index of the first occurrence of the desired sequence, eg in row 1 since the first occurrence of a 0 in the 4 0's we're looking for is in COL_1 then we should get a value of 2 for the INDEX column and row 2 get a NA since DEFINITION is "N", etc.此外,如果可能的话,如果 DEFINITION 列中有一个值“Y”,那么它应该返回所需序列第一次出现的列索引,例如在第 1 行中,因为在 4 个 0 中第一次出现 0我们正在寻找在 COL_1 中,那么我们应该为 INDEX 列获得 2 的值,第 2 行获得 NA,因为 DEFINITION 是“N”等。

Feel free to make any edits to make it clearer for other users and let me know if you need better information.随意进行任何编辑,以使其他用户更清楚,如果您需要更好的信息,请告诉我。

Cheers in advance :)提前干杯:)

EDIT:编辑:
Below is a slightly extended data table.下面是一个稍微扩展的数据表。 Let me know if this is sufficient.让我知道这是否足够。

TEST_DF <- data.table(P_ID = c(1,2,3,4,5,6,7,8,10),
                  COL_1 = c(0,0,0,0,0,0,0,5,90),
                  COL_2 = c(0,0,0,0,0,0,3,78,6),
                  COL_3 = c(0,0,0,0,0,0,7,5,0),
                  COL_4 = c(0,0,0,0,0,5,0,2,0),
                  COL_5 = c(0,0,0,0,0,7,2,0,0),
                  COL_6 = c(0,0,0,0,0,9,0,0,5),
                  COL_7 = c(0,0,0,0,0,1,0,0,6),
                  COL_8 = c(0,0,0,0,0,0,0,1,8),
                  COL_9 = c(0,0,0,0,0,1,6,1,0),
                  COL_10 = c(0,0,0,0,0,0,7,1,0),
                  COL_11 = c(0,0,0,0,0,0,8,3,0),
                  COL_12 = c(0,0,0,0,0,0,9,6,7),
                  DAYS = c(10,8,12,4,5,4,3,4,7))

Where the DEFINITION column for the rows would be c(1,1,1,1,1,0,1,0,0) where 1 is "Y" and 0 is "N".其中行的定义列是 c(1,1,1,1,1,0,1,0,0) ,其中 1 是“Y”,0 是“N”。 Either is ok.都可以。

For the INDEX column in the new edit the values should be c(2,2,2,2,2,NA,7,NA,NA)对于新编辑中的 INDEX 列,值应为 c(2,2,2,2,2,NA,7,NA,NA)

I think I understand this better now that the question has been edited some.我想我现在更好地理解了这个问题,因为这个问题已经被编辑了一些。 This has loops so it might not be optimal speed-wise, but the set statement should help with this.这有循环,所以它可能不是最佳速度,但 set 语句应该对此有所帮助。 It still has some of the speed-up that data.table provides.它仍然具有 data.table 提供的一些加速。

#Combined all column values in giant string
TEST_DF[ , COL_STRING := paste(COL_1,COL_2,COL_3,COL_4,COL_5,COL_6,COL_7,COL_8,COL_9,COL_10,COL_11,COL_12,sep=",")]
TEST_DF[ , COL_STRING := paste0(COL_STRING,",")]

#Using the Days variable, create a string to be searched
for (i in 1:nrow(TEST_DF))
  set(TEST_DF,i=i,j="FIND",value=paste(rep("0,",TEST_DF[i]$DAYS),sep="",collapse=""))

#Find where pattern starts. A negative 1 value means it does not exist
for (i in 1:nrow(TEST_DF))
  set(TEST_DF,i=i,j="INDEX",value=regexpr(TEST_DF[i]$FIND,TEST_DF[i]$COL_STRING,fixed=TRUE)[1])

#Define DEFINITION
TEST_DF[ , DEFINITION := 1*(INDEX != -1)]

#Find where pattern starts. A negative 1 value means it does not exist
require(stringr)
for (i in 1:nrow(TEST_DF))
  set(TEST_DF,i=i,j="INDEX",value=str_count(substr(TEST_DF[i]$COL_STRING,1,TEST_DF[i]$INDEX),","))

#Clean up variables
TEST_DF[ , INDEX := INDEX + DEFINITION*2L]
TEST_DF[INDEX==0L, INDEX := NA_integer_]

Was able to do this with some math trickery.能够通过一些数学技巧来做到这一点。 I created a binary matrix where an element is 1 if it was originally 0 and 0 otherwise.我创建了一个二元矩阵,其中一个元素如果最初是 0 则为 1,否则为 0。 Then, for each row I set the nth element in the row equal to the (n-1th element + the nth element) times the nth element.然后,对于每一行,我将行中的第 n 个元素设置为等于(第 n-1 个元素 + 第 n 个元素)乘以第 n 个元素。 In this transformed matrix, the value of an element is equal to the number of consecutive prior elements which were 0 (including this element).在这个变换矩阵中,一个元素的值等于连续为 0 的先验元素的数量(包括这个元素)。

m<-as.matrix(TEST_DF[, 2:(ncol(TEST_DF)-1L)])
m[m==1]<-2
m[m==0]<-1
m[m!=1]<-0

for(i in 2:ncol(m)){
  m[,i]=(m[,i-1]+m[,i])*m[,i]
}

# note the use of with=FALSE -- this forces ncol to be evaluated
#   outside of TEST_DF, leading the result to be used as a
#   column number instead of just evaluating to a scalar
m<-as.matrix(cbind(m, Days=TEST_DF[,ncol(TEST_DF),with=FALSE]))
indx<-apply(m[,-ncol(m)] >= m[,ncol(m)],1,function(x) match(TRUE,x) )

TEST_DF$DEFINITION<-ifelse(is.na(indx),0,1)
TEST_DF$INDEX<-indx-TEST_DF$DAYS+2

Note: I stole some stuff from this post注意:我从这篇文章中偷了一些东西

You might explore the IRanges package.您可以探索 IRanges 包。 I just defined the test dataset as a data.frame , since I am not familiar with data.table .我只是将测试数据集定义为data.frame ,因为我不熟悉data.table I then expanded it to your dataset size of 800000:然后我将其扩展到您的数据集大小 800000:

TEST_DF <- TEST_DF[sample(nrow(TEST_DF), 800000, replace=TRUE),]

Then, we put IRanges to work:然后,我们让 IRanges 工作:

library(IRanges)
m <- t(as.matrix(TEST_DF[,2:13]))
l <- relist(Rle(m), PartitioningByWidth(rep(nrow(m), ncol(m))))
r <- ranges(l)
validRuns <- width(r) >= TEST_DF$DAYS
TEST_DF$DEFINITION <- sum(validRuns) > 0
TEST_DF$INDEX <- drop(phead(start(r)[validRuns], 1)) + 1L

The first step simplifies the table to a matrix, so we can transpose and get things in the right layout for a light-weight partitioning ( PartitioningByWidth ) of the data into a type of list.第一步将表简化为矩阵,因此我们可以将数据的轻量级分区 ( PartitioningByWidth ) 转置并以正确的布局获取内容为列表类型。 The data are converted into a run-length encoding ( Rle ) along the way, which finds the runs of zeros in each row.数据在此过程中被转换为游程长度编码 ( Rle ),它会在每一行中找到零的Rle程。 We can extract the ranges representing the runs and then compute on them more efficiently than we might on the split Rle directly.我们可以提取代表运行的ranges ,然后比直接在拆分Rle上更有效地计算它们。 We find the runs that meet or exceed the DAYS and record which groups (rows) have at least one such run.我们找到满足或超过DAYS的运行,并记录哪些组(行)至少有一次这样的运行。 Finally, we find the start of the valid runs, take the first start for each group with phead , and drop so that those with no runs become NA .最后,我们发现start有效运行的,采取先开始每个组pheaddrop ,使那些没有运行变得NA

For 800,000 rows, this takes about 4 seconds.对于 800,000 行,这大约需要 4 秒。 If that's not fast enough, we can work on optimization.如果这还不够快,我们可以进行优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM