R：Quanteda从语料库中删除标签

Question

I am working with a number texts using the quanteda package. 我正在使用Quanteda软件包处理数字文本。 My texts contain tags in them, some with unique values like URLs. 我的文本中包含标签，其中一些具有唯一值，例如URL。 I want remove not only the tags but everything inside the tags as well. 我不仅要删除标签，还要删除标签内的所有内容。

Example: 例：

<oa>
</oa>
<URL: http://in.answers.yahoo.com/question/index;_ylt=Ap2wvXm2aeRQKHO.HeDgTfneQHRG;_ylv=3?qid=1006042400700>
<q>
<ad>
</ad>

I'm not sure how to remove them while working with the quanteda package. 我不确定在使用quanteda软件包时如何删除它们。 It seems to me like the dfm function would be the place to use it, I don't think stopwords will work because of the unique URLs. 在我看来， dfm函数将是使用它的地方，我认为stopwords不会因为唯一的URL而起作用。 I can use the following gsub with regex expression to successfully target the tags I want to remove: 我可以将以下gsub与正则表达式一起使用，以成功定位要删除的标签：

x <- gsub("<.*?>", "", y)

I've gone through the gfm documentation and have tried a few things with the remove and value type arguments, but perhaps I don't understand the documentation very well. 我已经阅读了gfm文档，并尝试使用remove和value类型参数进行了一些尝试，但是也许我不太了解该文档。

Also as shown by the answer in this question I tried the dfm_select function but no dice as well. 也如该问题答案所示，我尝试了dfm_select函数，但也没有骰子。

Here is my code: 这是我的代码：

library(readtext)
library(quanteda)

#create directory
data_dir <- list.files(pattern="*.txt", recursive = TRUE, full.names = TRUE)

#create corpus    
micusp_corpus <- corpus(readtext(data_dir))

#add field 'region'
docvars(micusp_corpus, "Region") <- gsub("(\\w{6})\\..*?$", "", rownames(micusp_corpus$documents))

#create document feature matrix
micusp_dfm <- dfm(micusp_corpus, groups = "Region", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
 #try to remove tags       
micusp_dfm <- dfm_select(micusp_dfm, "<.*?>", selection = "remove", valuetype = "regex")

#show top tokens (note the appearence of the tag content "oa")
textstat_frequency(micusp_dfm, n=10)

Answer 1

While your question does not provide a reproducible example, I think I can help. 尽管您的问题没有提供可重复的示例，但我想我可以提供帮助。 You want to clean the texts that go into your corpus, before you reach the dfm construction stage. 您想在进入dfm构建阶段之前清理进入语料库的文本。 Replace the #create corpus line with this: 将#create corpus行替换为：

# read texts, remove tags, and create the corpus
tmp <- readtext(data_dir)
tmp$text <- gsub("<.*?>", "", tmp$text)
micusp_corpus <- corpus(tmp)

R：Quanteda从语料库中删除标签

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-04-01 03:38:03

R：Quanteda从语料库中删除标签

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-04-01 03:38:03

解决方案1
2 已采纳 2019-04-01 03:38:03