简体   繁体   中英

Find multi-word strings in more than one document

In order to find frequent terms or phrases in document someone can use tf.

How ever if we know that there are some specific expressions in the text but we don't know the length or if the include any other information, is there any way to found them? Example:

df <- data.frame(text = c("Introduction Here you see something Related work another info here", "Introduction another text Background work something to now"))

Let's say these words are Introducton, Related work and Background work but we don't exactly which phrases are. How can we find them?

Here you need a method for detecting collocations, which fortunately quanteda has in the form of textstat_collocations() . Once you detect these, you can compound your tokens to make these into a single "token", and then get their frequencies in the standard way.

You do not need to know the length in advance, but do need to specify a range. Below, I've added some more text, and included a size range from 2 to 3. This also picks up the "criminal background check", without confusing the term "background" that is also in the phrase "background work". (By default, detection is case insensitive.)

library("quanteda")
## Package version: 2.1.0

text <- c(
  "Introduction Here you see something Related work another info here",
  "Introduction another text Background work something to now",
  "Background work is related to related work",
  "criminal background checks are useful",
  "The law requires criminal background checks"
)

colls <- textstat_collocations(text, size = 2:3)
colls
##                  collocation count count_nested length    lambda          z
## 1        criminal background     2            2      2  4.553877  2.5856967
## 2          background checks     2            2      2  4.007333  2.3794386
## 3               related work     2            2      2  2.871680  2.3412833
## 4            background work     2            2      2  2.322388  2.0862256
## 5 criminal background checks     2            0      3 -1.142097 -0.3426584

Here we can see that the phrases are being detected and distinguished. Now we can use tokens_compound to join them:

toks <- tokens(text) %>%
  tokens_compound(colls, concatenator = " ")

dfm(toks) %>%
  dfm_trim(min_termfreq = 2) %>%
  dfm_remove(stopwords("en")) %>%
  textstat_frequency()
##                      feature frequency rank docfreq group
## 1               introduction         2    1       2   all
## 2                  something         2    1       2   all
## 3                    another         2    1       2   all
## 4               related work         2    1       2   all
## 5            background work         2    1       2   all
## 6 criminal background checks         2    1       2   all

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM