I have a dataframe that looks like the following
df =
Number Words
1 A@pple11, Mango , !!!,Banana,...
2 G###,Clutter image, Focus^& yourself,..
3 ....
This is a small example to mimic the actual dataframe which is huge. I need to clean it up and create something as below
df =
Number Words
1 Apple11,Mango,Banana,...
2 G,Clutter image, Focus yourself,..
3 ....
I am using the following approach.
dt_2 <- df[, .(Tokens = unlist(strsplit(Words, split = '
'))), by = Number]
dt_2$Tokens = gsub('([[:punct:]])|\\s+','_',dt_2$Tokens)
dt_2[, Words := tm::scan_tokenizer(Tokens) %>%
tm::removePunctuation()
]
dt_2[, Stems := tm::stemDocument(Words)]
dt_2[, .N, by = Words]
CTP_clean <- dt_2[, .(Words = paste(Words, collapse = ' ')), by =
Number]
There are a couple of problems with this approach. One I am getting a warning
In `[.data.table`(dt_2, , `:=`(Words, tm::scan_tokenizer(Tokens) %>% :
Supplied 95577 items to be assigned to 95887 items of column 'Words'
(recycled leaving remainder of 310 items).
The second is the space separated words that are no longer being considered as single entity. Any help with regards to the warning and cleaning up would be great.
Maybe the following would work for you:
library(splitstackshape)
cSplit(test, "Words", ",", "long")[
, Words := gsub("[[:punct:]]", "", Words)][
Words != "", list(Words = toString(Words)), Number]
# Number Words
# 1: 1 Apple11, Mango, Banana
# 2: 2 G, Clutter image, Focus yourself
If you don't want the space between words, use:
paste(Words, collapse = ",")
instead of:
toString(Words)
You can, of course, not use "splitstackshape" -- I won't be offended. In that case, you can do something like:
test[, list(Words = unlist(strsplit(Words, ",", TRUE))), Number][
, Words := gsub("[[:punct:]]|^\\s+|\\s+$", "", Words)][
Words != "", list(Words = toString(Words)), Number]
I would use a list column in a data.table
and strsplit
like this:
# load package
require(data.table)
# create example data
test <- data.table(
Number = 1:3,
Words = c(
"A@pple11, Mango , !!!,Banana,",
" G###,Clutter image, Focus^& yourself,..",
" ...."
)
)
# split the strings into a list column
test[, Words2 := strsplit(Words, ",")]
# look at the output
# (The elements of the list column are displayed
# comma seperated, don't be confused by that.
test
test$Words2
test$Words2[[1]]
test$Words2[[2]][2]
And then use a Combination of lapply and data.table functions to clean each element in the resulting list column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.