[英]Replacing words with spaces within a tibble in R without anti-join
我有類似這樣的句子:
小標題:1,782 x 1
Chat
<chr>
1 Hi i would like to find out more about the trials
2 Hello I had a guest
3 Hello my friend overseas right now
...
我想做的是刪除停用詞,例如“ I”,“ hello”。 我已經有了它們的列表,並且想用空格替換這些停用詞。 我嘗試使用mutate和gsub,但它只接受一個正則表達式。 反連接在這里不起作用,因為我正在嘗試使用雙字組/三字組。我沒有一個單詞列來反連接停用詞。
有沒有辦法替換R中每個句子中的所有這些單詞?
我們可以replace
標記的嵌套,用空格( " "
) replace
'stop_words''word'列中找到的'word',並在按'lines'分組后paste
'word'
library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>%
unnest_tokens(word, Chat) %>%
mutate(word = replace(word, word %in% stop_words$word, " ")) %>%
group_by(lines) %>%
summarise(Chat = paste(word, collapse=' ')) %>%
ungroup %>%
select(-lines)
注意:這會將“ stop_words”數據集中找到的停用詞替換為" "
如果我們只需要替換停用詞的自定義子集,則創建這些元素的vector
並在mutate
步驟中進行更改
v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
...
...
mutate(word = replace(word %in% v1, " ")) %>%
...
...
我們可以使用“ \\\\b
停用詞\\\\b
”構造一個模式,然后使用gsub
將其替換為“”。 這是一個例子。 請注意,我將ignore.case = TRUE
設置為同時包含小寫和大寫字母,但是您可能需要根據需要進行調整。
dat <- read.table(text = "Chat
1 'Hi i would like to find out more about the trials'
2 'Hello I had a guest'
3 'Hello my friend overseas right now'",
header = TRUE, stringsAsFactors = FALSE)
dat
# Chat
# 1 Hi i would like to find out more about the trials
# 2 Hello I had a guest
# 3 Hello my friend overseas right now
# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\\b", stopword, "\\b")
stopword3 <- paste(stopword2, collapse = "|")
# View the pattern
stopword3
# [1] "\\bI\\b|\\bHello\\b|\\bHi\\b"
dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
# Chat
# 1 would like to find out more about the trials
# 2 had a guest
# 3 my friend overseas right now
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.