简体   繁体   中英

apply a function to each element of a column individually

Let us assume following as my data table data

data <- setDT(structure(list(col1 = c(1, 2, 3, 4, 5), col2 = c(53, 45, 54, 
97, 23), col3 = c("aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo", 
"cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da", "ee ee ee ee et eh es er eg"
), col4 = c("aa bb ff ff","aa ff vv rr","dd dd rr gg",
"yy yy rr rr","uu uu uu ee")), .Names = c("col1", "col2", "col3", "col4"), 
row.names = c(NA, -5L), class = "data.frame"))

col1     col2    col3                            col4
1        53      aa aa aa aa ab ad af ae ar      aa bb ff ff
2        45      bb bb bb bb bt by bu bi bo      aa ff vv rr
3        54      cc cc cc cc cd cy ch cn cd      dd dd rr gg
4        97      dd dd dd dd dt dy dj dk da      yy yy rr rr
5        23      ee ee ee ee et eh es er eg      uu uu uu ee

col3 has strings of words and I need to find that if the most frequently occurred word in col3 appears in col4 or not. So output will look like as follows:

col1     col2    col3                            col4          most_freq_word_in_cool3       out_col
1        53      aa aa aa aa ab ad af ae ar      aa bb ff ff             aa                      1  
2        45      bb bb bb bb bt by bu bi bo      aa ff vv rr             bb                      0
3        54      cc cc cc cc cd cy ch cn cd      dd dd rr gg             cc                      0  
4        97      dd dd dd dd dt dy dj dk da      yy yy rr rr             dd                      0 
5        23      ee ee ee ee et eh es er eg      uu uu uu ee             ee                       1

I tried the following solution

    m_fre_word1 <- function(x) { string <- as.character(unlist(strsplit(x, " ")))
                                freq <- sort(table(string), decreasing = T)
                                wr <-names(freq)[1]
                                return(wr) }

    data <- data[ , most_freq_word_in_cool3:= apply(data[ , .(col3)], 1, m_fre_word1)]
    data <- data[ , out_col:= as.numeric(grepl(m_fre_word1(col3), col4))]

There is nothing wrong with this solution, but it is really slow. My data table is huge. I can't use this way so I am looking for a faster alternative. Could somebody suggest a faster alternative.

Thanks,

Here's an attempt. Instead of running this whole thing on each row, I would suggest to split the column and operate on a long format.

I'm stealing the Mode function from here and it is defined as follows

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
} 

Now using the newest data.table version we could do

library(data.table) # v 1.9.6+
temp <- setDT(data6)[, tstrsplit(col3, " ", fixed = TRUE)]
data6[, res := melt(temp[, indx := .I], id = "indx")[, Mode(value), by = indx]$V1]
data6

#    col1 col2                       col3        col4 res
# 1:    1   53 aa aa aa aa ab ad af ae ar aa bb ff ff  aa
# 2:    2   45 bb bb bb bb bt by bu bi bo aa ff vv rr  bb
# 3:    3   54 cc cc cc cc cd cy ch cn cd dd dd rr gg  cc
# 4:    4   97 dd dd dd dd dt dy dj dk da yy yy rr rr  dd
# 5:    5   23 ee ee ee ee et eh es er eg uu uu uu ee  ee

The second step can be easily achieved by eitehr

data6[, out := +grepl(res, col4, fixed = TRUE), by = res]

OR (not sure which one is faster)

library(stringi)
data6[stri_detect_fixed(col4, res), out := 1L]

As a side note, when using reference semantics, there is no need to copy the whole data set and reassign it using <- , in fact that's the whole point of reference semantics. Please read this .

I think the apply(data[ , .(col3)] is what is slowing down your code. Generally I have found putting a subset within a data.table call causes huge slowdowns, as the subset operation is expensive.

You can try:

DT[ , test := names(sort(table(strsplit(col3," ")), decreasing = T))[1], by = col1]
DT[, search := gsub(" ","|",col4)]
DT[, output := grepl(search,test), by = col1]

Not sure if my or David's answer will be faster.

EDIT: Based on Frank's input , the last two lines can be replaced with:

DT[, output := mapply(grepl,gsub(" ","|",col4),test)]

(Edited as per comments below)

Load the library

require(data.table)

Define the data

x <-
  data.table(
    col1 = c(1, 2, 3, 4, 5),
    col2 = c(53, 45, 54, 97, 23),
    col3 = c(
      "aa aa aa aa ab ad af ae ar", "bb bb bb bb bt by bu bi bo", 
      "cc cc cc cc cd cy ch cn cd", "dd dd dd dd dt dy dj dk da",
      "ee ee ee ee et eh es er eg"),
    col4 = c(
      "aa bb ff ff","aa ff vv rr","dd dd rr gg",
      "yy yy rr rr","uu uu uu ee")
         )

Find the most freqent element in col3

x[,most_freq_word_in_col3:=sapply(col3,function(e){
  names(sort(table(unlist(strsplit(e," "))),decreasing=TRUE)[1])})]

Check whether this element in col4

x[,out_col:=apply(cbind(most_freq_word_in_col3,col4),1,function(e){
  as.numeric(e[1] %in% unlist(strsplit(e[2]," ")))})]

Output:

> x
   col1 col2                       col3        col4 most_freq_word_in_col3 out_col
1:    1   53 aa aa aa aa ab ad af ae ar aa bb ff ff                     aa       1
2:    2   45 bb bb bb bb bt by bu bi bo aa ff vv rr                     bb       0
3:    3   54 cc cc cc cc cd cy ch cn cd dd dd rr gg                     cc       0
4:    4   97 dd dd dd dd dt dy dj dk da yy yy rr rr                     dd       0
5:    5   23 ee ee ee ee et eh es er eg uu uu uu ee                     ee       1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM