简体   繁体   中英

R: having trouble using quanteda corpus with readtext

After reading my corpus with the Quanteda package, I get the same error when using various subsequent statements:

Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')").

For example, when using this simple statement: texts(mycorpus)[2] My actual goal is to create a dfm (which give me the same error message as above).

I read the corpus with this code:

`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im 
working on at the moment/Newspaper articles DJ/test data/*.txt", 
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication", 
"Length LexisNexis"), encoding = "UTF-8-BOM"))`

My dataset consists of 50 newspaper articles, including some metadata such as the date of publication.

See screenshot. 语料库

Why am I getting this error every time? Thanks very much in advance for your help!

Response 1:

When using just readtext() I get one step further and texts(text.corpus)[1] does not yield an error.

However, when tokenizing, the same error occurs again, so:

token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams 
= 1:2)
tokens(text.corpus)

Yields:

Error in UseMethod("tokenize") : no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')"

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')"

Response 2:

Now I get these two error messages in return, which I initially also got, so I started using corpus_frame()

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('corpus_frame', 'data.frame')"

In addition: Warning message: 'corpus' is deprecated. Use 'corpus_frame' instead. See help("Deprecated")

Do I need to specify that 'tokenization' or any other step is only applied to the 'text' column and not to the entire dataset?

Response 3:

Thank you, Patrick, this does clarify and brought me somewhat further. When running this:

# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  corpus() %>%
  tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

I get this:

Error in tokens_internal(texts(x), ...) : the ... list does not contain 3 elements In addition: Warning message: removePunctremoveNumbers is deprecated; use remove_punctremove_numbers instead

So I changed it accordingly (using remove_punct and remove_numbers ) and now the code runs well.

Alternatively, I also tried this:

# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)

Which gives this error:

Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) : unrecognized text filter property: 'drop_numbers'

After removing drop_numbers = TRUE , the matrix is actually produced. Thanks very much for your help!

To clarify the situation:

Versions 0.9.1 of the corpus package had a function called corpus . quanteda also has a function called corpus . To avoid the name clash between the two packages, the corpus corpus function got deprecated and renamed to corpus_frame in version 0.9.2; it was removed in version 0.9.3.

To avoid the name clash with quanteda , either upgrade to corpus to the latest version on CRAN (0.9.3), or else do

library(corpus)
library(quanteda)

Instead of the other order.


Now, if you want to use quanteda to tokenize your texts, follow the advice given in Ken's answer:

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)

You may want to use the dfm function instead of the tokens function if your goal is to get a document-by-term count matrix.

If you want to use the corpus package, instead do

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)

Depending on what you're trying to do, you might want to use the term_stats function instead of the term_matrix function.

OK, you are getting this error because (as the error message states) there is no tokens() method for a readtext object class, which is a special version of a data.frame. (Note: tokenize() is older, deprecated syntax that will be removed in the next version - use tokens() instead.)

You want this:

library("quanteda")
library("readtext")
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis"), 
         encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

It's the corpus() step you omitted. corpus_frame() is from a different package (my friend Patrick Perry's corpus ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM