简体   繁体   中英

Using tm() to mine PDFs for two and three word phrases

I'm trying to mine a set of PDFs for specific two and three word phrases. I know this question has been asked under various circumstances and

This solution partly works. However, the list does not return strings containing more than one word.

I've tried the solutions offered in these threads here , here , for example (as well as many others). Unfortunately nothing works.

Also, the qdap library won't load and I wasted an hour trying to solve that problem, so this solution won't work either, even though it seems reasonably easy.

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")

dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)

As you can see, the output returns "contract.prices" instead of "contract prices" so I'm looking for a simple solution to this. File 127 includes the phrase 'contract prices' so the table should record at least one instance of this.

I'm also happy to share my actual data, but I'm not sure how to save a small portion of it (it's gigantic). So for now I'm using a substitute with the 'crude' data.

Here is a way to get what you want using the tm package together with RWeka. You need to create a separate tokenizer function that you plug into the DocumentTermMatrix function. RWeka plays very nicely with tm for this.

If you don't want to install RWeka due to java dependencies, you can use any other package like tidytext or quanteda. If you have need of speed because of the size of your data, I advice using the quanteda package (example below the tm code). Quanteda runs in parallel and with quanteda_options you can specify how many cores you want to use (2 cores are the default).

note:

Note that the unigrams and bigrams in your dictionary overlap. In the example used you will see that in text 127 "prices" (3) and "contract prices" (1) will double count the prices.

library(tm)
library(RWeka)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")


# adjust to min = 2 and max = 3 for 2 and 3 word ngrams
RWeka_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 2)) 
}

dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
                                              dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)

For speed if you have a big corpus quanteda might be better:

library(quanteda)

corp_crude <- corpus(crude)
# adjust ngrams to 2:3 for 2 and 3 word ngrams
toks_crude <- tokens(corp_crude, ngrams = 1:2, concatenator = " ")
toks_crude <- tokens_keep(toks_crude, pattern = dictionary(list(words = my_words)), valuetype = "fixed")
dfm_crude <- dfm(toks_crude)
df1 <- convert(dfm_crude, to = "data.frame")

You can work with series of tokens in quanteda if you first wrap your multi-word patterns in phrase() function.

library("quanteda")
#> Package version: 1.5.1

data("crude", package = "tm")
data_corpus_crude <- corpus(crude)

my_words <- c("diamond", "contract prices", "diamond shamrock")

You could extract these using kwic() for instance.

kwic(data_corpus_crude, pattern = phrase(my_words))
#>                                                               
#>    [127, 1:1]                             |     Diamond      |
#>    [127, 1:2]                             | Diamond Shamrock |
#>  [127, 12:13]        today it had cut its | contract prices  |
#>  [127, 71:71] a company spokeswoman said. |     Diamond      |
#>                                   
#>  Shamrock Corp said that effective
#>  Corp said that effective today   
#>  for crude oil by 1.50            
#>  is the latest in a

Or, to make them permanently into "compounded" tokens, use tokens_compound() (shown here in a simple example).

tokens("The diamond mining company is called Diamond Shamrock.") %>%
    tokens_compound(pattern = phrase(my_words))
#> tokens from 1 document.
#> text1 :
#> [1] "The"              "diamond"          "mining"          
#> [4] "company"          "is"               "called"          
#> [7] "Diamond_Shamrock" "."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM