简体   繁体   English

R在文本中查找值

[英]R finding values inside text

In r, I would like to find the number of values in the Text Column that are contained in the vectors AA, BB, and CC. 在r中,我想找到向量AA,BB和CC中包含的“文本列”中的值数。 The text column length varies, and content varies (Note my r reads this column in as a factor with 1 level). 文本列的长度各不相同,内容也各不相同(请注意,我将此列读为1级因子)。

#SampleCode

AA = c(540,300,330)
BB = c(400)
CC = c(530,310)

for(i=1:length(Text))
{
  if (AA in Text[i]): 
      {
     A[i] = NumberofAFound/length(AA)
      }
  else if (BB in Text[i]) {
    B[i] = NumberofBFound/length(BB)
      }
  else if (CC in Text[i]) {
    C[i] = NumberofCFound/length(CC)
     }
}

Desired Output: 所需输出:

Day          A   B   C
01-Jan-14    1   0   0.5
02-Jan-15    0   1   0

Source File: 源文件:

  Day         Text
  01-Jan-14   The number 540, 300, 330. 


              The day is 530

  02-Jan-15   The day is 400

I'm sure there is a simpler solution, but here's an option: 我敢肯定有一个更简单的解决方案,但是这里有一个选择:

1) Make the vectors into a named list. 1)使向量成为一个命名列表。

vectorList <- list(A = AA, B = BB, C = CC)

2) Write a function that takes a vector of numbers, a dataframe with Day and Text columns, and a final column name as a string and then returns a dataframe with Day and proportion of times the vector was counted in the Text column. 2)编写一个函数,该函数接受一个数字向量,一个带Day和Text列的数据框,最后一个列名作为字符串,然后返回一个带Day和该向量在Text列中被计数的比例的数据框。

    check <- function(df, vector, colName) {
              z <- NULL
              for(i in unique(vector)) {
                  for(j in unique(df$Day)) {
                      one <- subset(df, Day == j)
                      x <- sapply(one$Text, function(x) grepl(as.character(i), x))
                      y <- sum(x)/length(vector)
                      z <- rbind(z, data.frame(Day = j,
                                               Value = i, 
                                               Prop = y,
                                               stringsAsFactors = FALSE))
                   }
              }
              a <- aggregate(z$Prop, 
                             by = list(Day = z$Day), 
                             FUN = sum)
              colnames(a)[2] <- colName
              a
    }

3) Use lapply to run the function on each element in the vector list. 3)使用lapply在向量列表中的每个元素上运行该函数。 This will return a list of dataframes. 这将返回数据帧列表。 This uses the names of the vectors to name the final columns of the dataframe (eg column "A" for the AA vector). 这使用向量的名称来命名数据帧的最后一列(例如,AA向量的列“ A”)。

dfList <- lapply(seq_along(vectorList), function(i) {
   colName <- paste(names(vectorList)[[i]])
   vector <- vectorList[[i]]
   check(df, vector = vector, colName = colName)
})

4) Reduce the list of dataframes into a single dataframe. 4)将数据框列表简化为单个数据框。

output <- Reduce(merge, dfList)

Hope that helps! 希望有帮助!


Data: 数据:

df <- data.frame(Day = c("01-Jan-14", "02-Jan-15"),
             Text = c("The number 540, 300, 330. The day is 530.", 
                      "The day is 400"))

AA <- as.vector(c(540,300,330))
BB <- as.vector(400)
CC <- as.vector(c(530,310))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM