R：在閱讀文本中使用 quanteda 語料庫時遇到問題

Question

使用 Quanteda 包閱讀我的語料庫后，在使用各種后續語句時出現相同的錯誤：

UseMethod("texts") 中的錯誤：沒有適用於應用於類 "c('corpus_frame', 'data.frame')") 的對象的 'texts' 的方法。

例如，當使用這個簡單的語句時： texts(mycorpus)[2]我的實際目標是創建一個 dfm（它給我與上面相同的錯誤消息）。

我用這個代碼閱讀了語料庫：

`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im 
working on at the moment/Newspaper articles DJ/test data/*.txt", 
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication", 
"Length LexisNexis"), encoding = "UTF-8-BOM"))`

我的數據集由 50 篇報紙文章組成，包括一些元數據，例如出版日期。

見截圖。

為什么我每次都會收到這個錯誤？ 非常感謝您的幫助！

回應 1：

當只使用readtext()我更進一步， texts(text.corpus)[1]不會產生錯誤。

但是，在標記化時，再次出現相同的錯誤，因此：

token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams 
= 1:2)
tokens(text.corpus)

產量：

UseMethod("tokenize") 中的錯誤：沒有適用於 'tokenize' 的方法應用於類 "c('readtext', 'data.frame')" 的對象

UseMethod("tokens") 中的錯誤：沒有適用於應用於類“c('readtext','data.frame')”的對象的“tokens”方法

回應 2：

現在我得到了這兩條錯誤信息，我最初也得到了，所以我開始使用corpus_frame()

UseMethod("tokens") 中的錯誤：沒有適用於應用於類 "c('corpus_frame', 'data.frame')" 的對象的 'tokens' 的方法

另外：警告消息：不推薦使用“語料庫”。 改用“corpus_frame”。 請參閱幫助（“已棄用”）

我是否需要指定“標記化”或任何其他步驟僅應用於“文本”列而不應用於整個數據集？

回應 3：

謝謝你，帕特里克，這確實澄清並讓我更進一步。 運行時：

# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  corpus() %>%
  tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

我明白了：

token_internal(texts(x), ...) 中的錯誤：...列表不包含 3 個元素另外：警告消息：removePunctremoveNumbers 已棄用； 改用 remove_punctremove_numbers

所以我相應地改變了它（使用remove_punct和remove_numbers ），現在代碼運行良好。

或者，我也試過這個：

# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)

這給出了這個錯誤：

term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) 中的錯誤：無法識別的文本過濾器屬性：'drop_numbers'

去掉drop_numbers = TRUE ，實際生成矩陣。 非常感謝您的幫助！

Answer 1

澄清情況：

corpus包的 0.9.1 版本有一個名為corpus的函數。 quanteda還有一個名為corpus的函數。 為了避免兩個包之間的名稱沖突，在 0.9.2 版本中， corpus corpus函數被棄用corpus_frame命名為corpus_frame ； 它在 0.9.3 版本中被刪除。

為避免與quanteda的名稱沖突，請將語料庫升級到 CRAN (0.9.3) 上的最新版本，否則執行

library(corpus)
library(quanteda)

而不是其他順序。

現在，如果您想使用quanteda來標記您的文本，請遵循 Ken 的回答中給出的建議：

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)

如果您的目標是獲取逐項計數矩陣，您可能希望使用dfm函數而不是tokens函數。

如果要使用語料庫包，請改為

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)

根據您嘗試執行的操作，您可能希望使用term_stats函數而不是term_matrix函數。

Answer 2

好的，您收到此錯誤是因為（如錯誤消息所述）沒有用於 readtext 對象類的tokens()方法，它是 data.frame 的特殊版本。 （注意： tokenize()是較舊的、已棄用的語法，將在下一個版本中刪除 - 使用tokens()代替。）

你要這個：

library("quanteda")
library("readtext")
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis"), 
         encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

這是您省略的corpus()步驟。 corpus_frame()來自不同的包（我的朋友 Patrick Perry 的語料庫）。

R：在閱讀文本中使用 quanteda 語料庫時遇到問題

問題描述

2 個解決方案

解決方案1
1 已采納 2017-10-12 14:22:40

解決方案2
0 2017-10-10 21:20:42

R：在閱讀文本中使用 quanteda 語料庫時遇到問題

問題描述

2 個解決方案

解決方案1 1 已采納 2017-10-12 14:22:40

解決方案2 0 2017-10-10 21:20:42

解決方案1
1 已采納 2017-10-12 14:22:40

解決方案2
0 2017-10-10 21:20:42