简体   繁体   English

计算文本中停用词的数量

[英]Counting the number of stop words in a text

I was wondering if anyone could help me with the following problem: I am trying to determine the number (count) of stop words in customer review texts. 我想知道是否有人可以帮助我解决以下问题:我试图确定客户评论文本中停用词的数量(计数)。 I am using the "quanteda" package stop words list in R. I have tokenised the text and filtered out the stop words by using the following code: 我正在R中使用“ quanteda”包停用词列表。我已经标记了文本,并使用以下代码过滤了停用词:

stop.words <- tokens_select(corpus2.tokens, stopwords())

However, I am now having trouble saving these results in such a way that I can count the actual number of stopwords included in each review. 但是,我现在很难以这种方式保存这些结果,以至于我无法计算每次评论中包含的停用词的实际数量。

Any tipps would be greatly appreciated. 任何小费将不胜感激。 Thanks in advance! 提前致谢!

You can use str_detect from stringr (or stri_detect from stringi ) to count the number of stopwords. 您可以使用str_detectstringr (或stri_detectstringi )计数停止字的数量。 str_detect will return TRUE or FALSE and these you can just count. str_detect将返回TRUEFALSE ,您可以进行计数。 Depending on which stopword list you have you can get different results. 根据您拥有的停用词列表,您可以获得不同的结果。 stopwords("en") from stopwords package will return 28. If you use stopwords(source = "smart") you will get a count of 61. stopwords("en")stopwords如果用包将返回28 stopwords(source = "smart")你会得到61的计数。

text <- "I've never had a better pulled pork pizza! The amount of toppings that they layered on it was astounding...bacon, corn, more pulled pork, and the sauce was delicious. I shared my pizza with 2 other people. I can't wait to go back."
stopwords <- stopwords::stopwords("en")

sum(stringr::str_detect(tolower(text), stopwords))
28

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM