[英]quanteda: remove tags (#,@) and url in on string
Consider the following string:考虑以下字符串:
txt <- ("Viele Dank für das Feedback + die Verbesserungsvorschläge! :) http://testurl.com/5lhk5p #Greenwashing #PR #Vattenfal")
I create a dfm (Create a document-feature matrix) and pre-process the string as followed:我创建了一个 dfm(创建文档特征矩阵)并按如下方式预处理字符串:
txt_corp <- quanteda::corpus(txt)
txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T)
topfeatures(txt_dfm)
The output looks then as follows:输出如下所示:
topfeatures(txt_dfm)
viele dank für das feedback
1 1 1 1 1
die verbesserungsvorschläge #greenwashing #pr #vattenfal
1 1 1 1 1
This is not bad.这还不错。 But I would like to have the output without the hashtag (#).
但我希望输出没有主题标签(#)。 I've tried some combinations like:
txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T, what ="word1")
我尝试了一些组合,例如:
txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T, what ="word1")
topfeatures(txt_dfm)
viele dank für das feedback
1 1 1 1 1
die verbesserungsvorschläge http testurl.com 5lhk5p
1 1 1 1 1
Then I receive the above output.然后我收到上面的输出。 On the one side the hashtags are removed, but on the other side the links are splitted and not removed.
一方面,主题标签被删除,但另一方面,链接被拆分而不是删除。 Can somebody help to obtain the following output using quanteda?
有人可以帮助使用 quanteda 获得以下输出吗?
viele dank für das feedback
1 1 1 1 1
die verbesserungsvorschläge greenwashing pr vattenfal
1 1 1 1 1
There is a regex pattern that matches hash tags in quanteda_options()
.有一个正则表达式模式匹配
quanteda_options()
中的哈希标签。 If you set NULL
to it, it stops preserving them.如果将
NULL
设置为它,它将停止保留它们。
require(quanteda)
quanteda_options(reset = TRUE)
quanteda_options("pattern_hashtag")
# [1] "#\\w+#?"
tokens("#aaaa bbbb")
# Tokens consisting of 1 document.
# text1 :
# [1] "#aaaa" "bbbb"
quanteda_options("pattern_hashtag" = NULL)
tokens("#aaaa bbbb")
# Tokens consisting of 1 document.
# text1 :
# [1] "#" "aaaa" "bbbb"
Remove the hashtag from your string ?从您的字符串中删除主题标签?
txt <- gsub("#","",txt)
> txt_dfm
Document-feature matrix of: 1 document, 10 features (0.0% sparse).
features
docs viele dank für das feedback die verbesserungsvorschläge greenwashing pr vattenfal
text1 1 1 1 1 1 1 1 1 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.