简体   繁体   中英

Cleaning comma separated words appearing in a column in R

I have a dataframe that looks like the following

df = 
Number    Words
 1        A@pple11, Mango   , !!!,Banana,...
 2        G###,Clutter image, Focus^& yourself,..
 3        ....

This is a small example to mimic the actual dataframe which is huge. I need to clean it up and create something as below

 df = 
 Number    Words
 1        Apple11,Mango,Banana,...
 2        G,Clutter image, Focus yourself,..
 3        ....

I am using the following approach.

   dt_2 <- df[, .(Tokens = unlist(strsplit(Words, split = ' 
   '))), by = Number]

   dt_2$Tokens =  gsub('([[:punct:]])|\\s+','_',dt_2$Tokens)

   dt_2[, Words := tm::scan_tokenizer(Tokens) %>%

     tm::removePunctuation()

  ]

   dt_2[, Stems := tm::stemDocument(Words)]

   dt_2[, .N, by = Words]

   CTP_clean <- dt_2[, .(Words = paste(Words, collapse = ' ')), by = 
   Number]

There are a couple of problems with this approach. One I am getting a warning

   In `[.data.table`(dt_2, , `:=`(Words, tm::scan_tokenizer(Tokens) %>%  :
   Supplied 95577 items to be assigned to 95887 items of column 'Words'     
   (recycled leaving remainder of 310 items).

The second is the space separated words that are no longer being considered as single entity. Any help with regards to the warning and cleaning up would be great.

Maybe the following would work for you:

library(splitstackshape)
cSplit(test, "Words", ",", "long")[
  , Words := gsub("[[:punct:]]", "", Words)][
    Words != "", list(Words = toString(Words)), Number]
#    Number                            Words
# 1:      1           Apple11, Mango, Banana
# 2:      2 G, Clutter image, Focus yourself

If you don't want the space between words, use:

paste(Words, collapse = ",")

instead of:

toString(Words)

You can, of course, not use "splitstackshape" -- I won't be offended. In that case, you can do something like:

test[, list(Words = unlist(strsplit(Words, ",", TRUE))), Number][
  , Words := gsub("[[:punct:]]|^\\s+|\\s+$", "", Words)][
    Words != "", list(Words = toString(Words)), Number]

I would use a list column in a data.table and strsplit like this:

# load package
require(data.table)

# create example data
test <- data.table(
  Number = 1:3, 
  Words = c(
    "A@pple11, Mango   , !!!,Banana,",
    " G###,Clutter image, Focus^& yourself,..",
    " ...."
  )
)

# split the strings into a list column
test[, Words2 := strsplit(Words, ",")]

# look at the output
# (The elements of the list column are displayed
# comma seperated, don't be confused by that.
test

test$Words2

test$Words2[[1]]

test$Words2[[2]][2]

And then use a Combination of lapply and data.table functions to clean each element in the resulting list column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM