简体   繁体   中英

Construct a character vector of multi-word phrases using regex for building dfm using quanteda in R

I've used to great satisfaction quanteda's textstat_collocation() for extracting MWE. Now I'm trying to extract all matches that match a specific pattern, irrespective of their frequency.

My objective is to create a character vector by extracting featnames from a dfm() built with a regex pattern. I will then use this character vector in the "select" argument for building a dfm. I might also want to use this character vector to add to a dictionary I use as an ontology for building dfms at later stages of the pipeline.

The pattern is: "aged xx-xx" where x is a digit.

I used the regex pattern "aged\\s([0-9]{2}-[0-9]{2})" here and got the desired matches. But when I try it in R (adding an additional "\\" before "\\s"), I don't get any matches.

When I do:

txt <- c("In India, male smokers aged 20-45 perceive brandX positively.",
              "In Spain, female buyers aged 30-39 don't purchase brandY.")
ageGroups <- dfm(txt, select = "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)

I get:

character(0)

However, when I try:

ageGroups <- dfm(txt, select = "([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)

I get:

[1] "20-45" "30-39"

It seems I'm unable to capture the white space in the regex. I've gone through many similar questions in SO, with perhaps this being the most relevant, but still can't get to make my specific objective to work.

I also tried:

tokens <- tokens(txt, remove_punct = FALSE, remove_numbers = FALSE, remove_symbols = FALSE)
tokensCompunded <- tokens_compound(tokens, pattern =  "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
attr(tokensCompunded, "types")

But I get all tokens back:

[1] "In"         " "          "India"      ","          "male"       "smokers"    "aged"       "20-45"      "perceive"  
[10] "brandX"     "positively" "."          "Spain"      "female"     "buyers"     "30-39"      "don't"      "purchase"  
[19] "brandY" 

I think there might be several other more efficient approaches for extracting character vectors using regex (or glob) with quanteda, and I'm happy to learn new ways how to use this amazing R package.

Thanks for your help!

Edit to original question:

This other question in SO has a similar requirement, ie detecting multi-word phrases using kwic objects, and can be further expanded to achieve the objectives stated above with the following addition:

kwicObject <- kwic(corpus, pattern = phrase("aged ([0-9]{2}-[0-9]{2})"), valuetype = "regex")
unique(kwicObject$keyword)

您可以更改正则表达式模式:

select = "aged.*([0-9]{2}-[0-9]{2})"

The problem here is that the target text and the multi-word pattern (which contains white space) are not being tokenised the same way. In your example, you have applied a regex for multiple tokens (which includes the whitespace separator) but the target for search has already been split into individual tokens.

We devised a solution to this, a function called phrase() . From ?pattern :

Whitespace is not privileged, so that in a character vector, white space is interpreted literally. If you wish to consider whitespace-separated elements as sequences of tokens, wrap the argument in phrase() .

So in this case:

pat <- "aged [0-9]{2}-[0-9]{2}"

toks2 <- tokens_select(toks, pattern = phrase(pat), valuetype = "regex")
toks2
# tokens from 2 documents.
# text1 :
# [1] "aged"  "20-45"
# 
# text2 :
# [1] "aged"  "30-39"

Here, we see that the selection worked, because the phrase() wrapper converted the pattern into a sequence of matches.

If you want to make these a single token, you can send the same pattern argument to tokens_compound() :

toks3 <- tokens_compound(toks2, pattern = phrase(pat), 
                         valuetype = "regex", concatenator = " ")
toks3
# tokens from 2 documents.
# text1 :
# [1] "aged 20-45"
# 
# text2 :
# [1] "aged 30-39"

Finally, you can use that to construct a dfm, where each multi-word match is a feature. This cannot work unless you have first performed the concatenation at the tokens stage, since by definition a dfm has no order in its features.

dfm(toks3)
# Document-feature matrix of: 2 documents, 2 features (50% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#        features
# docs    aged 20-45 aged 30-39
#   text1          1          0
#   text2          0          1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM