简体   繁体   English

R:在阅读文本中使用 quanteda 语料库时遇到问题

[英]R: having trouble using quanteda corpus with readtext

After reading my corpus with the Quanteda package, I get the same error when using various subsequent statements:使用 Quanteda 包阅读我的语料库后,在使用各种后续语句时出现相同的错误:

Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')"). UseMethod("texts") 中的错误:没有适用于应用于类 "c('corpus_frame', 'data.frame')") 的对象的 'texts' 的方法。

For example, when using this simple statement: texts(mycorpus)[2] My actual goal is to create a dfm (which give me the same error message as above).例如,当使用这个简单的语句时: texts(mycorpus)[2]我的实际目标是创建一个 dfm(它给我与上面相同的错误消息)。

I read the corpus with this code:我用这个代码阅读了语料库:

`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im 
working on at the moment/Newspaper articles DJ/test data/*.txt", 
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication", 
"Length LexisNexis"), encoding = "UTF-8-BOM"))`

My dataset consists of 50 newspaper articles, including some metadata such as the date of publication.我的数据集由 50 篇报纸文章组成,包括一些元数据,例如出版日期。

See screenshot.见截图。 语料库

Why am I getting this error every time?为什么我每次都会收到这个错误? Thanks very much in advance for your help!非常感谢您的帮助!

Response 1:回应 1:

When using just readtext() I get one step further and texts(text.corpus)[1] does not yield an error.当只使用readtext()我更进一步, texts(text.corpus)[1]不会产生错误。

However, when tokenizing, the same error occurs again, so:但是,在标记化时,再次出现相同的错误,因此:

token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams 
= 1:2)
tokens(text.corpus)

Yields:产量:

Error in UseMethod("tokenize") : no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')" UseMethod("tokenize") 中的错误:没有适用于 'tokenize' 的方法应用于类 "c('readtext', 'data.frame')" 的对象

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')" UseMethod("tokens") 中的错误:没有适用于应用于类“c('readtext','data.frame')”的对象的“tokens”方法

Response 2:回应 2:

Now I get these two error messages in return, which I initially also got, so I started using corpus_frame()现在我得到了这两条错误信息,我最初也得到了,所以我开始使用corpus_frame()

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('corpus_frame', 'data.frame')" UseMethod("tokens") 中的错误:没有适用于应用于类 "c('corpus_frame', 'data.frame')" 的对象的 'tokens' 的方法

In addition: Warning message: 'corpus' is deprecated.另外:警告消息:不推荐使用“语料库”。 Use 'corpus_frame' instead.改用“corpus_frame”。 See help("Deprecated")请参阅帮助(“已弃用”)

Do I need to specify that 'tokenization' or any other step is only applied to the 'text' column and not to the entire dataset?我是否需要指定“标记化”或任何其他步骤仅应用于“文本”列而不应用于整个数据集?

Response 3:回应 3:

Thank you, Patrick, this does clarify and brought me somewhat further.谢谢你,帕特里克,这确实澄清并让我更进一步。 When running this:运行时:

# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  corpus() %>%
  tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

I get this:我明白了:

Error in tokens_internal(texts(x), ...) : the ... list does not contain 3 elements In addition: Warning message: removePunctremoveNumbers is deprecated; token_internal(texts(x), ...) 中的错误:...列表不包含 3 个元素另外:警告消息:removePunctremoveNumbers 已弃用; use remove_punctremove_numbers instead改用 remove_punctremove_numbers

So I changed it accordingly (using remove_punct and remove_numbers ) and now the code runs well.所以我相应地改变了它(使用remove_punctremove_numbers ),现在代码运行良好。

Alternatively, I also tried this:或者,我也试过这个:

# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)

Which gives this error:这给出了这个错误:

Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) : unrecognized text filter property: 'drop_numbers' term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) 中的错误:无法识别的文本过滤器属性:'drop_numbers'

After removing drop_numbers = TRUE , the matrix is actually produced.去掉drop_numbers = TRUE ,实际生成矩阵。 Thanks very much for your help!非常感谢您的帮助!

To clarify the situation:澄清情况:

Versions 0.9.1 of the corpus package had a function called corpus . corpus包的 0.9.1 版本有一个名为corpus的函数。 quanteda also has a function called corpus . quanteda还有一个名为corpus的函数。 To avoid the name clash between the two packages, the corpus corpus function got deprecated and renamed to corpus_frame in version 0.9.2;为了避免两个包之间的名称冲突,在 0.9.2 版本中, corpus corpus函数被弃用corpus_frame命名为corpus_frame it was removed in version 0.9.3.它在 0.9.3 版本中被删除。

To avoid the name clash with quanteda , either upgrade to corpus to the latest version on CRAN (0.9.3), or else do为避免与quanteda的名称冲突,请将语料库升级到 CRAN (0.9.3) 上的最新版本,否则执行

library(corpus)
library(quanteda)

Instead of the other order.而不是其他顺序。


Now, if you want to use quanteda to tokenize your texts, follow the advice given in Ken's answer:现在,如果您想使用quanteda来标记您的文本,请遵循 Ken 的回答中给出的建议:

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)

You may want to use the dfm function instead of the tokens function if your goal is to get a document-by-term count matrix.如果您的目标是获取逐项计数矩阵,您可能希望使用dfm函数而不是tokens函数。

If you want to use the corpus package, instead do如果要使用语料库包,请改为

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)

Depending on what you're trying to do, you might want to use the term_stats function instead of the term_matrix function.根据您尝试执行的操作,您可能希望使用term_stats函数而不是term_matrix函数。

OK, you are getting this error because (as the error message states) there is no tokens() method for a readtext object class, which is a special version of a data.frame.好的,您收到此错误是因为(如错误消息所述)没有用于 readtext 对象类的tokens()方法,它是 data.frame 的特殊版本。 (Note: tokenize() is older, deprecated syntax that will be removed in the next version - use tokens() instead.) (注意: tokenize()是较旧的、已弃用的语法,将在下一个版本中删除 - 使用tokens()代替。)

You want this:你要这个:

library("quanteda")
library("readtext")
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis"), 
         encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

It's the corpus() step you omitted.这是您省略的corpus()步骤。 corpus_frame() is from a different package (my friend Patrick Perry's corpus ). corpus_frame()来自不同的包(我的朋友 Patrick Perry 的语料库)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM