简体   繁体   English

quanteda:删除字符串中的标签 (#,@) 和 url

[英]quanteda: remove tags (#,@) and url in on string

Consider the following string:考虑以下字符串:

txt <- ("Viele Dank für das Feedback + die Verbesserungsvorschläge! :) http://testurl.com/5lhk5p #Greenwashing #PR #Vattenfal")

I create a dfm (Create a document-feature matrix) and pre-process the string as followed:我创建了一个 dfm(创建文档特征矩阵)并按如下方式预处理字符串:

txt_corp <- quanteda::corpus(txt)
txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T)
topfeatures(txt_dfm)

The output looks then as follows:输出如下所示:

topfeatures(txt_dfm)
              viele                    dank                     für                     das                feedback 
                  1                       1                       1                       1                       1 
                die verbesserungsvorschläge           #greenwashing                     #pr              #vattenfal 
                  1                       1                       1                       1                       1 

This is not bad.这还不错。 But I would like to have the output without the hashtag (#).但我希望输出没有主题标签(#)。 I've tried some combinations like: txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T, what ="word1")我尝试了一些组合,例如: txt_dfm <- quanteda::dfm(txt_corp,remove_punct=TRUE, remove_symbols=TRUE, remove_url = T, what ="word1")

topfeatures(txt_dfm)
              viele                    dank                     für                     das                feedback 
                  1                       1                       1                       1                       1 
                die verbesserungsvorschläge                    http             testurl.com                  5lhk5p 
                  1                       1                       1                       1                       1 

Then I receive the above output.然后我收到上面的输出。 On the one side the hashtags are removed, but on the other side the links are splitted and not removed.一方面,主题标签被删除,但另一方面,链接被拆分而不是删除。 Can somebody help to obtain the following output using quanteda?有人可以帮助使用 quanteda 获得以下输出吗?

                  viele                    dank                     für                     das                feedback 
                  1                       1                       1                       1                       1 
                die verbesserungsvorschläge           greenwashing                     pr              vattenfal 
                  1                       1                       1                       1                       1 

There is a regex pattern that matches hash tags in quanteda_options() .有一个正则表达式模式匹配quanteda_options()中的哈希标签。 If you set NULL to it, it stops preserving them.如果将NULL设置为它,它将停止保留它们。

require(quanteda)
quanteda_options(reset = TRUE)
quanteda_options("pattern_hashtag")     
# [1] "#\\w+#?"
tokens("#aaaa bbbb")
# Tokens consisting of 1 document.
# text1 :
# [1] "#aaaa" "bbbb" 

quanteda_options("pattern_hashtag" = NULL)
tokens("#aaaa bbbb")
# Tokens consisting of 1 document.
# text1 :
# [1] "#"    "aaaa" "bbbb"

Remove the hashtag from your string ?从您的字符串中删除主题标签?

txt <- gsub("#","",txt)

> txt_dfm
Document-feature matrix of: 1 document, 10 features (0.0% sparse).
       features
docs    viele dank für das feedback die verbesserungsvorschläge greenwashing pr vattenfal
  text1     1    1   1   1        1   1                       1            1  1         1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM