简体   繁体   English

从语料库中删除特殊字符

[英]Remove special character from corpus

I built a data that shows all the terms with punctuation and its frequency. 我建立了一个数据,显示标点符号及其频率的所有术语。 Then im supposed to remove the punctuation's from them and check if there is any punctuation remaining. 然后,我应该从它们中删除标点符号,并检查是否还有剩余的标点符号。

newpapers1 <- tm_map(newpapers, removePunctuation)

punremove <- function(x){gsub(c('¡'|'¯'),"",x)}
punremove1 <- lapply(newpapers1, punremove)
my.check.func <- function(x){str_extract_all(x, "[[:punct:]]")}
my.check1 <- lapply(newpapers1, my.check.func)
p <- as.data.frame(table(unlist(my.check1)))
p

But I still end up with this special character: 但是我仍然以这个特殊字符结束:

  Var1 Freq
1    ¡   25

Is there a way to write a function to remove all the punctuation's together or a function to remove this? 有没有办法编写将所有标点符号一起删除的函数,或者将其删除的函数?

Edit: Upon checking the documents the punctuation still exists: 编辑:检查文档后,标点符号仍然存在:

> newpapers1[[24]]$content

"This study employs a crosscultural perspective to examine how local audiences perceive and enjoy foreign dramas and how this psychological process differs depending on the cultural distance between the media and the viewing audience Using a convenience sample of young Korean college students this study as predicted by cultural discount theory shows that cultural distance decreases Korean audiences¡¯ perceived identification with dramatic characters which erodes their enjoyment of foreign dramas Unlike cultural discount theory however cultural distance arouses Korean audiences¡¯ perception of novelty which heightens their enjoyment of foreign dramas This study discusses the theoretical and practical implications of these findings as well as their potential limitations" “这项研究采用了跨文化的视角,研究了本地观众如何看待和欣赏外国戏剧,以及这种心理过程如何根据媒体与观众之间的文化距离而有所不同。折扣理论表明,文化距离减少了韩国观众对戏剧性人物的感知认同,从而侵蚀了他们对外国戏剧的欣赏。不同于文化折扣理论,文化距离激发了韩国观众对新颖性的感知,从而提高了他们对外国戏剧的欣赏。这些发现的实际意义和潜在的局限性”


You can use gsub to remove the punctuation, like this. 您可以像这样使用gsub删除标点符号。

newpapers1 <- tm_map(newpapers, removePunctuation)

my.check.func <- function(x){gsub('[[:punct:]]+','',x)}
my.check1 <- lapply(newpapers1, my.check.func)
p <- as.data.frame(table(unlist(my.check1)))
p

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM