简体   繁体   English

R:Quanteda从语料库中删除标签

[英]R: quanteda removing tags from corpus

I am working with a number texts using the quanteda package. 我正在使用Quanteda软件包处理数字文本。 My texts contain tags in them, some with unique values like URLs. 我的文本中包含标签,其中一些具有唯一值,例如URL。 I want remove not only the tags but everything inside the tags as well. 我不仅要删除标签,还要删除标签内的所有内容。

Example: 例:

<oa>
</oa>
<URL: http://in.answers.yahoo.com/question/index;_ylt=Ap2wvXm2aeRQKHO.HeDgTfneQHRG;_ylv=3?qid=1006042400700>
<q>
<ad>
</ad>

I'm not sure how to remove them while working with the quanteda package. 我不确定在使用quanteda软件包时如何删除它们。 It seems to me like the dfm function would be the place to use it, I don't think stopwords will work because of the unique URLs. 在我看来, dfm函数将是使用它的地方,我认为stopwords不会因为唯一的URL而起作用。 I can use the following gsub with regex expression to successfully target the tags I want to remove: 我可以将以下gsub与正则表达式一起使用,以成功定位要删除的标签:

x <- gsub("<.*?>", "", y)

I've gone through the gfm documentation and have tried a few things with the remove and value type arguments, but perhaps I don't understand the documentation very well. 我已经阅读了gfm文档,并尝试使用remove和value类型参数进行了一些尝试,但是也许我不太了解该文档。

Also as shown by the answer in this question I tried the dfm_select function but no dice as well. 也如该问题答案所示,我尝试了dfm_select函数,但也没有骰子。

Here is my code: 这是我的代码:

library(readtext)
library(quanteda)

#create directory
data_dir <- list.files(pattern="*.txt", recursive = TRUE, full.names = TRUE)

#create corpus    
micusp_corpus <- corpus(readtext(data_dir))

#add field 'region'
docvars(micusp_corpus, "Region") <- gsub("(\\w{6})\\..*?$", "", rownames(micusp_corpus$documents))

#create document feature matrix
micusp_dfm <- dfm(micusp_corpus, groups = "Region", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
 #try to remove tags       
micusp_dfm <- dfm_select(micusp_dfm, "<.*?>", selection = "remove", valuetype = "regex")

#show top tokens (note the appearence of the tag content "oa")
textstat_frequency(micusp_dfm, n=10)

While your question does not provide a reproducible example, I think I can help. 尽管您的问题没有提供可重复的示例,但我想我可以提供帮助。 You want to clean the texts that go into your corpus, before you reach the dfm construction stage. 您想在进入dfm构建阶段之前清理进入语料库的文本。 Replace the #create corpus line with this: #create corpus行替换为:

# read texts, remove tags, and create the corpus
tmp <- readtext(data_dir)
tmp$text <- gsub("<.*?>", "", tmp$text)
micusp_corpus <- corpus(tmp)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM