[英]Replacing words with spaces within a tibble in R without anti-join
I have a tibble of sentences like so: 我有类似这样的句子:
A tibble: 1,782 x 1 小标题:1,782 x 1
Chat
<chr>
1 Hi i would like to find out more about the trials
2 Hello I had a guest
3 Hello my friend overseas right now
...
What I'm trying to do is to remove stopwords like "I", "hello". 我想做的是删除停用词,例如“ I”,“ hello”。 I already have a list of them and I want to replace these stopwords with a space. 我已经有了它们的列表,并且想用空格替换这些停用词。 I tried using mutate and gsub but it only takes in a regex. 我尝试使用mutate和gsub,但它只接受一个正则表达式。 Anti join won't work here as I am trying to do bigram/trigram I don't have a single word column to anti-join the stopwords. 反连接在这里不起作用,因为我正在尝试使用双字组/三字组。我没有一个单词列来反连接停用词。
Is there a way to replace all these words in each sentences in R? 有没有办法替换R中每个句子中的所有这些单词?
We could unnest the tokens, replace
the 'word' that is found in the 'stop_words' 'word' column with space ( " "
), and paste
the 'word' after grouping by 'lines' 我们可以replace
标记的嵌套,用空格( " "
) replace
'stop_words''word'列中找到的'word',并在按'lines'分组后paste
'word'
library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>%
unnest_tokens(word, Chat) %>%
mutate(word = replace(word, word %in% stop_words$word, " ")) %>%
group_by(lines) %>%
summarise(Chat = paste(word, collapse=' ')) %>%
ungroup %>%
select(-lines)
NOTE: This replaces the stop words found in 'stop_words' dataset to " "
If we need only a custom subset of stop words to be replaced, then create a vector
of those elements and do the change in the mutate
step 注意:这会将“ stop_words”数据集中找到的停用词替换为" "
如果我们只需要替换停用词的自定义子集,则创建这些元素的vector
并在mutate
步骤中进行更改
v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
...
...
mutate(word = replace(word %in% v1, " ")) %>%
...
...
We can construct a pattern with " \\\\b
stop word \\\\b
" and then use gsub
to replace them with "". 我们可以使用“ \\\\b
停用词\\\\b
”构造一个模式,然后使用gsub
将其替换为“”。 Here is an example. 这是一个例子。 Notice that I set ignore.case = TRUE
to include both lower and upper case, but you may want to adjust that for your needs. 请注意,我将ignore.case = TRUE
设置为同时包含小写和大写字母,但是您可能需要根据需要进行调整。
dat <- read.table(text = "Chat
1 'Hi i would like to find out more about the trials'
2 'Hello I had a guest'
3 'Hello my friend overseas right now'",
header = TRUE, stringsAsFactors = FALSE)
dat
# Chat
# 1 Hi i would like to find out more about the trials
# 2 Hello I had a guest
# 3 Hello my friend overseas right now
# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\\b", stopword, "\\b")
stopword3 <- paste(stopword2, collapse = "|")
# View the pattern
stopword3
# [1] "\\bI\\b|\\bHello\\b|\\bHi\\b"
dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
# Chat
# 1 would like to find out more about the trials
# 2 had a guest
# 3 my friend overseas right now
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.