[英]R: quanteda removing tags from corpus
I am working with a number texts using the quanteda package. 我正在使用Quanteda软件包处理数字文本。 My texts contain tags in them, some with unique values like URLs.
我的文本中包含标签,其中一些具有唯一值,例如URL。 I want remove not only the tags but everything inside the tags as well.
我不仅要删除标签,还要删除标签内的所有内容。
Example: 例:
<oa>
</oa>
<URL: http://in.answers.yahoo.com/question/index;_ylt=Ap2wvXm2aeRQKHO.HeDgTfneQHRG;_ylv=3?qid=1006042400700>
<q>
<ad>
</ad>
I'm not sure how to remove them while working with the quanteda
package. 我不确定在使用
quanteda
软件包时如何删除它们。 It seems to me like the dfm
function would be the place to use it, I don't think stopwords
will work because of the unique URLs. 在我看来,
dfm
函数将是使用它的地方,我认为stopwords
不会因为唯一的URL而起作用。 I can use the following gsub
with regex expression to successfully target the tags I want to remove: 我可以将以下
gsub
与正则表达式一起使用,以成功定位要删除的标签:
x <- gsub("<.*?>", "", y)
I've gone through the gfm documentation and have tried a few things with the remove and value type arguments, but perhaps I don't understand the documentation very well. 我已经阅读了gfm文档,并尝试使用remove和value类型参数进行了一些尝试,但是也许我不太了解该文档。
Also as shown by the answer in this question I tried the dfm_select
function but no dice as well. 也如该问题答案所示,我尝试了
dfm_select
函数,但也没有骰子。
Here is my code: 这是我的代码:
library(readtext)
library(quanteda)
#create directory
data_dir <- list.files(pattern="*.txt", recursive = TRUE, full.names = TRUE)
#create corpus
micusp_corpus <- corpus(readtext(data_dir))
#add field 'region'
docvars(micusp_corpus, "Region") <- gsub("(\\w{6})\\..*?$", "", rownames(micusp_corpus$documents))
#create document feature matrix
micusp_dfm <- dfm(micusp_corpus, groups = "Region", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
#try to remove tags
micusp_dfm <- dfm_select(micusp_dfm, "<.*?>", selection = "remove", valuetype = "regex")
#show top tokens (note the appearence of the tag content "oa")
textstat_frequency(micusp_dfm, n=10)
While your question does not provide a reproducible example, I think I can help. 尽管您的问题没有提供可重复的示例,但我想我可以提供帮助。 You want to clean the texts that go into your corpus, before you reach the dfm construction stage.
您想在进入dfm构建阶段之前清理进入语料库的文本。 Replace the
#create corpus
line with this: 将
#create corpus
行替换为:
# read texts, remove tags, and create the corpus
tmp <- readtext(data_dir)
tmp$text <- gsub("<.*?>", "", tmp$text)
micusp_corpus <- corpus(tmp)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.