简体   繁体   English

使用正则表达式构造多词短语的字符向量以在R中使用Quanteda构建dfm

[英]Construct a character vector of multi-word phrases using regex for building dfm using quanteda in R

I've used to great satisfaction quanteda's textstat_collocation() for extracting MWE. 我曾经非常满意Quanteda的textstat_collocation()来提取MWE。 Now I'm trying to extract all matches that match a specific pattern, irrespective of their frequency. 现在,我尝试提取与特定模式匹配的所有匹配项,而不考虑它们的出现频率。

My objective is to create a character vector by extracting featnames from a dfm() built with a regex pattern. 我的目标是通过从使用正则表达式模式构建的dfm()中提取功能名称来创建字符向量。 I will then use this character vector in the "select" argument for building a dfm. 然后,我将在“选择”参数中使用此字符向量来构建dfm。 I might also want to use this character vector to add to a dictionary I use as an ontology for building dfms at later stages of the pipeline. 我可能还想使用此字符向量将其添加到字典中,该字典将用作在管道后期构建dfms的本体。

The pattern is: "aged xx-xx" where x is a digit. 模式是:“ xx-xx岁”,其中x是数字。

I used the regex pattern "aged\\s([0-9]{2}-[0-9]{2})" here and got the desired matches. 我在这里使用了正则表达式模式“ aged \\ s([0-9] {2}-[0-9] {2})” 并获得了所需的匹配项。 But when I try it in R (adding an additional "\\" before "\\s"), I don't get any matches. 但是,当我在R中尝试(在“ \\ s”之前添加一个附加的“ \\”)时,没有任何匹配。

When I do: 当我做:

txt <- c("In India, male smokers aged 20-45 perceive brandX positively.",
              "In Spain, female buyers aged 30-39 don't purchase brandY.")
ageGroups <- dfm(txt, select = "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)

I get: 我得到:

character(0)

However, when I try: 但是,当我尝试:

ageGroups <- dfm(txt, select = "([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)

I get: 我得到:

[1] "20-45" "30-39"

It seems I'm unable to capture the white space in the regex. 看来我无法捕获正则表达式中的空白。 I've gone through many similar questions in SO, with perhaps this being the most relevant, but still can't get to make my specific objective to work. 我在SO中经历了许多类似的问题,也许是最相关的,但仍然无法使我的特定目标发挥作用。

I also tried: 我也尝试过:

tokens <- tokens(txt, remove_punct = FALSE, remove_numbers = FALSE, remove_symbols = FALSE)
tokensCompunded <- tokens_compound(tokens, pattern =  "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
attr(tokensCompunded, "types")

But I get all tokens back: 但是我得到了所有令牌:

[1] "In"         " "          "India"      ","          "male"       "smokers"    "aged"       "20-45"      "perceive"  
[10] "brandX"     "positively" "."          "Spain"      "female"     "buyers"     "30-39"      "don't"      "purchase"  
[19] "brandY" 

I think there might be several other more efficient approaches for extracting character vectors using regex (or glob) with quanteda, and I'm happy to learn new ways how to use this amazing R package. 我认为可能还有其他一些更有效的方法使用带有ededa的正则表达式(或glob)来提取字符向量,我很高兴学习如何使用此惊人R包的新方法。

Thanks for your help! 谢谢你的帮助!

Edit to original question: 编辑原始问题:

This other question in SO has a similar requirement, ie detecting multi-word phrases using kwic objects, and can be further expanded to achieve the objectives stated above with the following addition: 在其他SO问题也有类似的要求,即,使用对象KWIC检测多字短语,并且可以进一步扩展以实现与下面的加成上述的目标:

kwicObject <- kwic(corpus, pattern = phrase("aged ([0-9]{2}-[0-9]{2})"), valuetype = "regex")
unique(kwicObject$keyword)

您可以更改正则表达式模式:

select = "aged.*([0-9]{2}-[0-9]{2})"

The problem here is that the target text and the multi-word pattern (which contains white space) are not being tokenised the same way. 这里的问题是目标文本和多字pattern (包含空格)没有以相同的方式标记。 In your example, you have applied a regex for multiple tokens (which includes the whitespace separator) but the target for search has already been split into individual tokens. 在您的示例中,您为多个标记(包括空格分隔符)应用了一个正则表达式,但是搜索目标已被拆分为单个标记。

We devised a solution to this, a function called phrase() . 我们为此设计了一个解决方案,即一个称为phrase()的函数。 From ?pattern : 来自?pattern

Whitespace is not privileged, so that in a character vector, white space is interpreted literally. 空格没有特权,因此在字符向量中,空格按字面意义进行解释。 If you wish to consider whitespace-separated elements as sequences of tokens, wrap the argument in phrase() . 如果您希望将以空格分隔的元素视为标记序列,请将参数包装在phrase()

So in this case: 因此,在这种情况下:

pat <- "aged [0-9]{2}-[0-9]{2}"

toks2 <- tokens_select(toks, pattern = phrase(pat), valuetype = "regex")
toks2
# tokens from 2 documents.
# text1 :
# [1] "aged"  "20-45"
# 
# text2 :
# [1] "aged"  "30-39"

Here, we see that the selection worked, because the phrase() wrapper converted the pattern into a sequence of matches. 在这里,我们看到选择是有效的,因为phrase()包装器将模式转换为匹配序列。

If you want to make these a single token, you can send the same pattern argument to tokens_compound() : 如果要将这些作为单个令牌,则可以将相同的pattern参数发送给tokens_compound()

toks3 <- tokens_compound(toks2, pattern = phrase(pat), 
                         valuetype = "regex", concatenator = " ")
toks3
# tokens from 2 documents.
# text1 :
# [1] "aged 20-45"
# 
# text2 :
# [1] "aged 30-39"

Finally, you can use that to construct a dfm, where each multi-word match is a feature. 最后,您可以使用它来构建dfm,其中每个多字匹配都是一个功能。 This cannot work unless you have first performed the concatenation at the tokens stage, since by definition a dfm has no order in its features. 除非您首先在令牌阶段执行串联,否则这将无法工作,因为根据定义,dfm的功能没有顺序。

dfm(toks3)
# Document-feature matrix of: 2 documents, 2 features (50% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#        features
# docs    aged 20-45 aged 30-39
#   text1          1          0
#   text2          0          1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM