简体   繁体   English

使用 tm() 挖掘 PDF 中的两个和三个单词短语

[英]Using tm() to mine PDFs for two and three word phrases

I'm trying to mine a set of PDFs for specific two and three word phrases.我正在尝试为特定的两个和三个单词短语挖掘一组 PDF。 I know this question has been asked under various circumstances and我知道这个问题是在各种情况下被问到的,并且

This solution partly works.解决方案部分有效。 However, the list does not return strings containing more than one word.但是,该列表不会返回包含多个单词的字符串。

I've tried the solutions offered in these threads here , here , for example (as well as many others).例如,我已经尝试过这些线程中提供的解决方案herehere (以及许多其他人)。 Unfortunately nothing works.不幸的是,没有任何效果。

Also, the qdap library won't load and I wasted an hour trying to solve that problem, so this solution won't work either, even though it seems reasonably easy.此外,qdap 库不会加载,我浪费了一个小时试图解决这个问题,所以这个解决方案也不起作用,即使它看起来相当容易。

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")

dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)

As you can see, the output returns "contract.prices" instead of "contract prices" so I'm looking for a simple solution to this.如您所见,output 返回“contract.prices”而不是“contract prices”,所以我正在寻找一个简单的解决方案。 File 127 includes the phrase 'contract prices' so the table should record at least one instance of this.文件 127 包含短语“合同价格”,因此该表应至少记录一个实例。

I'm also happy to share my actual data, but I'm not sure how to save a small portion of it (it's gigantic).我也很乐意分享我的实际数据,但我不确定如何保存其中的一小部分(它是巨大的)。 So for now I'm using a substitute with the 'crude' data.所以现在我正在使用“原始”数据的替代品。

Here is a way to get what you want using the tm package together with RWeka.这是一种使用 tm package 和 RWeka 来获得所需内容的方法。 You need to create a separate tokenizer function that you plug into the DocumentTermMatrix function.您需要创建一个单独的标记器 function 并将其插入DocumentTermMatrix function。 RWeka plays very nicely with tm for this. RWeka 在这方面与tm配合得非常好。

If you don't want to install RWeka due to java dependencies, you can use any other package like tidytext or quanteda.如果由于 java 依赖关系而不想安装 RWeka,则可以使用任何其他 package,例如 tidytext 或 quanteda。 If you have need of speed because of the size of your data, I advice using the quanteda package (example below the tm code).如果您因为数据的大小而需要速度,我建议您使用 quanteda package(tm 代码下方的示例)。 Quanteda runs in parallel and with quanteda_options you can specify how many cores you want to use (2 cores are the default). Quanteda 并行运行,您可以使用quanteda_options指定要使用的内核数(默认为 2 个内核)。

note:笔记:

Note that the unigrams and bigrams in your dictionary overlap.请注意,字典中的一元和二元重叠。 In the example used you will see that in text 127 "prices" (3) and "contract prices" (1) will double count the prices.在所使用的示例中,您将看到在文本 127 中“价格”(3) 和“合同价格”(1) 将重复计算价格。

library(tm)
library(RWeka)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")


# adjust to min = 2 and max = 3 for 2 and 3 word ngrams
RWeka_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 2)) 
}

dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
                                              dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)

For speed if you have a big corpus quanteda might be better:如果你有一个大的语料库 quanteda,速度可能会更好:

library(quanteda)

corp_crude <- corpus(crude)
# adjust ngrams to 2:3 for 2 and 3 word ngrams
toks_crude <- tokens(corp_crude, ngrams = 1:2, concatenator = " ")
toks_crude <- tokens_keep(toks_crude, pattern = dictionary(list(words = my_words)), valuetype = "fixed")
dfm_crude <- dfm(toks_crude)
df1 <- convert(dfm_crude, to = "data.frame")

You can work with series of tokens in quanteda if you first wrap your multi-word patterns in phrase() function.如果您首先将多词模式包装在phrase() function 中,则可以在quanteda中使用一系列标记。

library("quanteda")
#> Package version: 1.5.1

data("crude", package = "tm")
data_corpus_crude <- corpus(crude)

my_words <- c("diamond", "contract prices", "diamond shamrock")

You could extract these using kwic() for instance.例如,您可以使用kwic()提取这些。

kwic(data_corpus_crude, pattern = phrase(my_words))
#>                                                               
#>    [127, 1:1]                             |     Diamond      |
#>    [127, 1:2]                             | Diamond Shamrock |
#>  [127, 12:13]        today it had cut its | contract prices  |
#>  [127, 71:71] a company spokeswoman said. |     Diamond      |
#>                                   
#>  Shamrock Corp said that effective
#>  Corp said that effective today   
#>  for crude oil by 1.50            
#>  is the latest in a

Or, to make them permanently into "compounded" tokens, use tokens_compound() (shown here in a simple example).或者,要将它们永久地变成“复合”标记,请使用tokens_compound() (此处显示为一个简单示例)。

tokens("The diamond mining company is called Diamond Shamrock.") %>%
    tokens_compound(pattern = phrase(my_words))
#> tokens from 1 document.
#> text1 :
#> [1] "The"              "diamond"          "mining"          
#> [4] "company"          "is"               "called"          
#> [7] "Diamond_Shamrock" "."

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM