![](/img/trans.png)
[英]text analysis - looking to remove lowercase words from my corpus. Can I use gsub? Using tm_map to clean-up
[英]I can't remove • and some other special characters such as '- using tm_map
我搜索所有問題,並能夠在第一組命令中替換掉•。 但是,當我申請我的語料庫時,它不起作用,•仍然出現。 語料庫有6570個元素,大小為2.3mb,因此似乎是有效的。
> x <- ". R Tutorial"
> gsub("•","",x)
[1] ". R Tutorial"
> removeSpecialChars <- function(x) gsub("•","",x)
> corpus2=tm_map(corpus2, removeSpecialChars)
> print(corpus2[[6299]][1])
[1] "• R tutorial • success– october"
> ##remove special characters
對於以更直接的方式與語料庫對象一起工作的替代方法呢?
require(quanteda)
require(magrittr)
corpus3 <- corpus(c("• R Tutorial", "More of these • characters •", "Tricky •!"))
# remove the character from the tokenized corpus
tokens(corpus3)
## tokens from 3 documents.
## text1 :
## [1] "R" "Tutorial"
##
## text2 :
## [1] "More" "of" "these" "characters"
##
## text3 :
## [1] "Tricky" "!"
tokens(corpus3) %>% tokens_remove("•")
## tokens from 3 documents.
## [1] "R" "Tutorial"
## text1 :
##
## text2 :
## [1] "More" "of" "these" "characters"
##
## text3 :
## [1]] "Tricky" "!"
# remove the character from the corpus itself
texts(corpus3) <- gsub("•", "", texts(corpus3), fixed = TRUE)
texts(corpus3)
## text1 text2 text3
## " R Tutorial" "More of these characters " "Tricky !"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.