简体   繁体   English

如何使用 readtext 将多个 JSON 文件加载到 quanteda 语料库中?

[英]How to load multiple JSON files into a quanteda corpus using readtext?

I'm trying to load a large number of JSON files from a news website into a quanteda corpus using readtext .我正在尝试使用readtext将新闻网站上的大量 JSON 文件加载到 quanteda 语料库中。 To simplify the process, the JSON files are all in the working directory.为了简化过程,JSON 文件都在工作目录中。 But I have also tried them in their own directory.但我也在他们自己的目录中尝试过它们。

  1. When using c() to create a variable that explicitly defines a small subset of files, readtext works as hoped and a corpus is properly created with corpus() .当使用c()创建一个明确定义一小部分文件的变量时, readtext可以正常工作,并且使用corpus()正确创建了语料库。
  2. When attempting to create a variable using list.files() to list all of the +1500 JSON files readtext does not work as hoped, errors are returned, and a corpus is not created.尝试使用list.files()创建变量以列出所有 +1500 JSON 文件时, readtext无法按预期工作,返回错误,并且未创建语料库。

I tried to inspect the results of the two methods of defining the set of texts (ie c() and list.files() ) as well as paste0() .我试图检查定义文本集(即c()list.files() )以及paste0()的两种方法的结果。

# Load libraries
library(readtext)
library(quanteda)

# Define a set of texts explicitly
a <- c("border_2020_05_10__1589150513.json","border_2020_05_10__1589143358.json","border_2020_05_07__1589170960.json")

# This produces a corpus
extracted_texts <- readtext(a, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
# Define a set of all texts in working directory
b <- list.files(pattern = "*.json", full.names = F)

# This, which I hope to use, produces an error
extracted_texts <- readtext(b, text_field = "maintext")
my_corpus <- corpus(extracted_texts)

The error produced by extracted_texts <- readtext(b, text_field = "maintext") is as follows extracted_texts <- readtext(b, text_field = "maintext")产生的错误如下

File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.

This is perplexing because the same files called with a do not produce an error.这很令人困惑,因为用a调用的相同文件不会产生错误。 I validated several of the JSON files which in every case returned VALID (RFC 8259), the IETF standard for JSON.验证了几个 JSON 文件,这些文件在每种情况下都返回 VALID (RFC 8259),即 JSON 的IETF 标准

Inspecting the differences between a and b :检查ab之间的差异:

  • typeof() returns "character" for both a and b . typeof()ab返回"character"
  • is.vector() and is.atomic() return TRUE for both. is.vector()is.atomic()两者都返回TRUE
  • is.list() returns FALSE for both. is.list()对两者都返回FALSE
  • they look similar in RStudio and when called in the console它们在 RStudio 和在控制台中调用时看起来相似

I'm really confused why a works and b does not.我真的很困惑为什么a有效而b无效。

Lastly, attempting to exactly mimic procedures employed at the readtext documentation the following was also tried:最后,尝试完全模仿readtext 文档中使用的程序,还尝试了以下操作:

# XXXX = my username
data_dir <- file.path("C:/Users/XXXX/Documents/R/")

d <- readtext(paste0(data_dir, "/corpus_linguistics/*.json"), text_field = "maintext")

This also returned the error这也返回了错误

File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.

At this point I'm stumped.在这一点上,我很难过。 Thanks in advance for any insight on how to move forward.提前感谢您对如何前进的任何见解。

Solution and Summary解决方案和总结

  1. Unclean Data: A few of the input JSON files have a null main_text field.不干净的数据:一些输入 JSON 文件具有 null main_text字段。 These are not useful for analysis and should be removed.这些对分析没有用,应该删除。 All of the files contain a JSON field called "title_rss" that is null.所有文件都包含一个名为"title_rss"的 JSON 字段,即 null。 This can be eliminated through a directory level find and replace with Notepad ++, or probably R or Python though I still lack the skills for this.这可以通过目录级别的查找和使用 Notepad ++ 替换,或者可能是 R 或 Python 来消除,尽管我仍然缺乏这方面的技能。 Additionally, the files were not in UTF-8 encoding, that was resolved with Codepage Converter .此外,这些文件不在 UTF-8 编码中,这已通过Codepage Converter解决。
  2. Method to call directory string: The list.files() method is employed in the readtext How to Use documentation and several third party tutorials .调用目录字符串的方法:在 readtext 如何使用文档和一些第三方教程中使用了list.files()方法。 This method works with *.txt files but for some reason it does not seem to work with these particular JSON files.此方法适用于 *.txt 文件,但由于某种原因,它似乎不适用于这些特定的 JSON 文件。 Once the JSON files are properly cleaned and encoded, the method below works without errors.一旦 JSON 文件被正确清理和编码,下面的方法就可以正常工作。 If the data_dir is wrapped in a list.files() function it produces the following error: Error in list_files(file, ignore_missing, TRUE, verbosity): File '' does not exist.如果data_dir包含在list.files() function 中,则会产生以下错误: list_files 中的Error in list_files(file, ignore_missing, TRUE, verbosity): File '' does not exist. I'm not sure why that is, but leaving it out works for these JSON files.我不确定为什么会这样,但是对于这些 JSON 文件来说,将其排除在外。
# Load libraries
library(readtext)
library(quanteda)

# Define a set of texts explicitly
data_dir <- "C:/Users/Nathan/Documents/R/corpus_linguistics/"
extracted_texts <- readtext(paste0(data_dir, "texts_unmodified/*.json"), text_field = "maintext", verbosity = 3)
my_corpus <- corpus(extracted_texts)

Test with unmodified files, one known to have empty fields使用未修改的文件进行测试,其中一个已知有空字段

Input: 5 files consisting of 4 w/o an empty or null text_field and 1 file with a null text field .输入: 5 个文件,其中 4 个没有空或text_field文本字段和 1 个带有 null text field的文件。 In addition, all of the files have Western European (Windows) 1252 Encoding.此外,所有文件都具有西欧 (Windows) 1252 编码。

Errors:错误:

Reading texts from C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/*.json
, using glob pattern
 ... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_02_17__1589147645.json
File doesn't contain a single valid JSON object.
 contain a single valid JSON object.
 ... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_03_13__1589150325.json
File doesn't contain a single valid JSON object.
Column 14 ['maintext'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform. ... read 5 documents.

Result: a properly formed corpus consisting of 5 documents.结果:由 5 个文档组成的正确格式的语料库。 One document lacks either tokens or types.一个文档缺少标记或类型。 The corpus seems to build properly despite the errors.尽管存在错误,但语料库似乎可以正常构建。 Perhaps some special characters don't display properly because of the encoding issue.由于编码问题,可能某些特殊字符无法正确显示。 I was not able to check this.我无法检查这一点。

Test with cleaned files known to have no empty fields使用已知没有空字段的已清理文件进行测试

Input files: 4 files that have no empty or null JSON fields.输入文件: 4 个没有空字段或 null JSON 字段的文件。 In all cases, text_field contains text and the title_rss field was removed.在所有情况下, text_field都包含文本,并且title_rss字段已被删除。 Each of the files was converted from Western European (Windows) 1252 into Unicode UTF-8-65001.每个文件都从西欧 (Windows) 1252 转换为 Unicode UTF-8-65001。

Errors: NONE!错误:无!

Result: A properly formed corpus.结果:正确形成的语料库。

Many thanks to the two developers for detailed feedback and useful leads.非常感谢两位开发人员的详细反馈和有用的线索。 The assistance is deeply appreciated.对援助深表感谢。

There are a few possibilities here, but the most likely are:这里有几种可能性,但最有可能的是:

  1. One of your files has a malformed JSON structure, from the point of view of readtext() .readtext()的角度来看,您的一个文件具有格式错误的 JSON 结构。 Even though this might be OK from a strictly JSON format, if one of your text fields is empty, for instance, then this will cause the error.即使从严格的 JSON 格式来看这可能没问题,例如,如果您的文本字段之一为空,那么这将导致错误。 (See below for a demonstration and a solution.) (请参阅下面的演示和解决方案。)

  2. While readtext() can take a "glob" pattern match , list.files() takes a regular expression.虽然readtext()可以采用“glob”模式匹配list.files()采用正则表达式。 It's possible (but unlikely) that you are picking up something you don't want then in list.files(pattern = "*.json"... . But this should not be necessary with readtext() -- see below.您有可能(但不太可能)在list.files(pattern = "*.json"...中拾取不需要的东西。但这对于readtext()来说是不必要的——见下文。

To demonstrate, let's write out each document in data_corpus_inaugural as a separate JSON file, and then read them in using readtext() .为了演示,让我们将data_corpus_inaugural中的每个文档写成单独的 JSON 文件,然后使用readtext()将它们读入。

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

tmpdir <- tempdir()
corpdf <- convert(data_corpus_inaugural, to = "data.frame")
for (d in corpdf$doc_id) {
  cat(jsonlite::toJSON(dplyr::filter(corpdf, doc_id == d)),
    file = paste0(tmpdir, "/", d, ".json")
  )
}

head(list.files(tmpdir))
## [1] "1789-Washington.json" "1793-Washington.json" "1797-Adams.json"     
## [4] "1801-Jefferson.json"  "1805-Jefferson.json"  "1809-Madison.json"

To read them in, you can use the "glob" pattern patch here and just read the JSON files.要读取它们,您可以在此处使用“glob”模式补丁,然后只需读取 JSON 文件即可。

rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
  text_field = "text", docid_field = "doc_id"
)
summary(corpus(rt), n = 5)
## Corpus consisting of 58 documents, showing 5 documents:
## 
##                  Text Types Tokens Sentences Year  President FirstName
##  1789-Washington.json   625   1537        23 1789 Washington    George
##  1793-Washington.json    96    147         4 1793 Washington    George
##       1797-Adams.json   826   2577        37 1797      Adams      John
##   1801-Jefferson.json   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson.json   804   2380        45 1805  Jefferson    Thomas
##                  Party
##                   none
##                   none
##             Federalist
##  Democratic-Republican
##  Democratic-Republican

So that all worked fine.所以一切都很好。

But if we add to this one file whose text field is empty, then this produces the error in question:但是如果我们添加到这个文本字段为空的文件中,那么这会产生有问题的错误:

cat('[ { "doc_id" : "d1", "text" : "this is a file" },
       { "doc_id" : "d2", "text" :  } ]',
  file = paste0(tmpdir, "/badfile.json")
)
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
  text_field = "text", docid_field = "doc_id"
)
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.

True, that was not a valid JSON file, since it contained a tag with no value.没错,这不是一个有效的 JSON 文件,因为它包含一个没有值的标签。 But I suspect you have something like that in one of your files.但我怀疑你的一个文件中有类似的东西。

Here's how you can identify the problem: loop through your b (from the question, not as I've specified it below).这是您识别问题的方法:遍历您的b (来自问题,而不是我在下面指定的)。

b <- tail(list.files(tmpdir, pattern = ".*\\.json", full.names = TRUE))
for (f in b) {
  cat("Reading:", f, "\n")
  rt <- readtext::readtext(f, text_field = "text", docid_field = "doc_id")
}
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2001-Bush.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2005-Bush.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2009-Obama.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2013-Obama.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2017-Trump.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/badfile.json 
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM