简体   繁体   中英

Add new column to a data.table; created using assign in loop

I have a data.frame keywordsCategory which contains a set of phrases that I would like to categorize depending of words I want to check with.

For example, one of my "check terms" is test1 , with correspond to category cat1 . As the first observation of my data.frame is This is a test1 , I need to include in a new column category with the corresponding category.

Because one observation can be assigned to more than one category, I though that the best option was to create independent subsets of my data.frame using grepl for lately binding all in a new data.frame

library(data.table)

wordsToCheck <- c("test1", "test2", "This")
categoryToAssign <- c("cat1", "cat2", "cat3")

keywordsCategory <- data.frame(Keyword=c("This is a test1", "This is a test2"))

for (i in 1:length(wordsToCheck)) {
        myOriginal <- wordsToCheck[i]
        myCategory <- categoryToAssign[i]
        dfToCreate <- paste0("withCategory",i)
        assign(dfToCreate, 
               data.table(keywordsCategory[grepl(paste0(".*",myOriginal,".*"),
                                                 keywordsCategory$Keyword)==TRUE,]))
        # this wont work :(
        # dfToCreate[,category:=myCategory]
}

# Create a list with all newly created data.tables
l.df <- lapply(ls(pattern="withCategory[0-9]+"), function(x) get(x))

# Create an aggregated dataframe with all Keywords data.tables
newdf <- do.call("rbind", l.df)

The subset > rbind works, but I am not beign able to assign the corresponging category to my new created data.tables. If I uncomment the line, I get following error:

Error in := (category, myCategory) : Check that is.data.table(DT) == TRUE. Otherwise, := and := (...) are defined for use in j, once only and in particular ways. See help(":=").

However, if I add the column manually once the loop is done, fi:

withCategory1[,category:=myCategory]

It works correctly and the table output is as expected:

> withCategory1
                V1 category
1: This is a test1     cat2

tableOutput <- structure(list(V1 = structure(1L, .Label = c("This is a test1", 
"This is a test2"), class = "factor"), category = "cat2"), .Names = c("V1", 
"category"), row.names = c(NA, -1L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000000001f0788>)

Which is the best/safest method to add a new column to a data.table when it is created inside a loop using the assign function? The solution doesn't need to use data.tables, as I only use it because my real data have millions of observations and I thought data.table would be faster.

As an alternative for your for-loop, you use a comination of paste0 , mapply and grepl to get what you want:

# create a 'data.table'
newDT <- as.data.table(keywordsCategory)

# assign the correct categories to each row
newDT[, category := paste0(categoryToAssign[mapply(grepl, wordsToCheck, Keyword)], collapse = ','), 1:nrow(newDT)]

which gives:

> newDT
           Keyword  category
1: This is a test1 cat1,cat3
2: This is a test2 cat2,cat3

If you want to expand the category column to one category on each row, see this Q&A for several methods how to do that. With for example:

library(splitstackshape)
cSplit(newDT, 'category', ",", direction = 'long')

you get:

           Keyword category
1: This is a test1     cat1
2: This is a test1     cat3
3: This is a test2     cat2
4: This is a test2     cat3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM