简体   繁体   中英

R finding values inside text

In r, I would like to find the number of values in the Text Column that are contained in the vectors AA, BB, and CC. The text column length varies, and content varies (Note my r reads this column in as a factor with 1 level).

#SampleCode

AA = c(540,300,330)
BB = c(400)
CC = c(530,310)

for(i=1:length(Text))
{
  if (AA in Text[i]): 
      {
     A[i] = NumberofAFound/length(AA)
      }
  else if (BB in Text[i]) {
    B[i] = NumberofBFound/length(BB)
      }
  else if (CC in Text[i]) {
    C[i] = NumberofCFound/length(CC)
     }
}

Desired Output:

Day          A   B   C
01-Jan-14    1   0   0.5
02-Jan-15    0   1   0

Source File:

  Day         Text
  01-Jan-14   The number 540, 300, 330. 


              The day is 530

  02-Jan-15   The day is 400

I'm sure there is a simpler solution, but here's an option:

1) Make the vectors into a named list.

vectorList <- list(A = AA, B = BB, C = CC)

2) Write a function that takes a vector of numbers, a dataframe with Day and Text columns, and a final column name as a string and then returns a dataframe with Day and proportion of times the vector was counted in the Text column.

    check <- function(df, vector, colName) {
              z <- NULL
              for(i in unique(vector)) {
                  for(j in unique(df$Day)) {
                      one <- subset(df, Day == j)
                      x <- sapply(one$Text, function(x) grepl(as.character(i), x))
                      y <- sum(x)/length(vector)
                      z <- rbind(z, data.frame(Day = j,
                                               Value = i, 
                                               Prop = y,
                                               stringsAsFactors = FALSE))
                   }
              }
              a <- aggregate(z$Prop, 
                             by = list(Day = z$Day), 
                             FUN = sum)
              colnames(a)[2] <- colName
              a
    }

3) Use lapply to run the function on each element in the vector list. This will return a list of dataframes. This uses the names of the vectors to name the final columns of the dataframe (eg column "A" for the AA vector).

dfList <- lapply(seq_along(vectorList), function(i) {
   colName <- paste(names(vectorList)[[i]])
   vector <- vectorList[[i]]
   check(df, vector = vector, colName = colName)
})

4) Reduce the list of dataframes into a single dataframe.

output <- Reduce(merge, dfList)

Hope that helps!


Data:

df <- data.frame(Day = c("01-Jan-14", "02-Jan-15"),
             Text = c("The number 540, 300, 330. The day is 530.", 
                      "The day is 400"))

AA <- as.vector(c(540,300,330))
BB <- as.vector(400)
CC <- as.vector(c(530,310))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM