[英]How to load multiple JSON files into a quanteda corpus using readtext?
I'm trying to load a large number of JSON files from a news website into a quanteda corpus using readtext
.我正在尝试使用
readtext
将新闻网站上的大量 JSON 文件加载到 quanteda 语料库中。 To simplify the process, the JSON files are all in the working directory.为了简化过程,JSON 文件都在工作目录中。 But I have also tried them in their own directory.
但我也在他们自己的目录中尝试过它们。
c()
to create a variable that explicitly defines a small subset of files, readtext
works as hoped and a corpus is properly created with corpus()
.c()
创建一个明确定义一小部分文件的变量时, readtext
可以正常工作,并且使用corpus()
正确创建了语料库。list.files()
to list all of the +1500 JSON files readtext
does not work as hoped, errors are returned, and a corpus is not created.list.files()
创建变量以列出所有 +1500 JSON 文件时, readtext
无法按预期工作,返回错误,并且未创建语料库。 I tried to inspect the results of the two methods of defining the set of texts (ie c()
and list.files()
) as well as paste0()
.我试图检查定义文本集(即
c()
和list.files()
)以及paste0()
的两种方法的结果。
# Load libraries
library(readtext)
library(quanteda)
# Define a set of texts explicitly
a <- c("border_2020_05_10__1589150513.json","border_2020_05_10__1589143358.json","border_2020_05_07__1589170960.json")
# This produces a corpus
extracted_texts <- readtext(a, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
# Define a set of all texts in working directory
b <- list.files(pattern = "*.json", full.names = F)
# This, which I hope to use, produces an error
extracted_texts <- readtext(b, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
The error produced by extracted_texts <- readtext(b, text_field = "maintext")
is as follows extracted_texts <- readtext(b, text_field = "maintext")
产生的错误如下
File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.
This is perplexing because the same files called with a
do not produce an error.这很令人困惑,因为用
a
调用的相同文件不会产生错误。 I validated several of the JSON files which in every case returned VALID (RFC 8259), the IETF standard for JSON.我验证了几个 JSON 文件,这些文件在每种情况下都返回 VALID (RFC 8259),即 JSON 的IETF 标准。
Inspecting the differences between a
and b
:检查
a
和b
之间的差异:
typeof()
returns "character"
for both a
and b
. typeof()
为a
和b
返回"character"
。is.vector()
and is.atomic()
return TRUE
for both. is.vector()
和is.atomic()
两者都返回TRUE
。is.list()
returns FALSE
for both. is.list()
对两者都返回FALSE
。I'm really confused why a
works and b
does not.我真的很困惑为什么
a
有效而b
无效。
Lastly, attempting to exactly mimic procedures employed at the readtext documentation the following was also tried:最后,尝试完全模仿readtext 文档中使用的程序,还尝试了以下操作:
# XXXX = my username
data_dir <- file.path("C:/Users/XXXX/Documents/R/")
d <- readtext(paste0(data_dir, "/corpus_linguistics/*.json"), text_field = "maintext")
This also returned the error这也返回了错误
File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.
At this point I'm stumped.在这一点上,我很难过。 Thanks in advance for any insight on how to move forward.
提前感谢您对如何前进的任何见解。
main_text
field.main_text
字段。 These are not useful for analysis and should be removed."title_rss"
that is null."title_rss"
的 JSON 字段,即 null。 This can be eliminated through a directory level find and replace with Notepad ++, or probably R or Python though I still lack the skills for this.list.files()
method is employed in the readtext How to Use documentation and several third party tutorials .list.files()
方法。 This method works with *.txt files but for some reason it does not seem to work with these particular JSON files.data_dir
is wrapped in a list.files()
function it produces the following error: Error in list_files(file, ignore_missing, TRUE, verbosity): File '' does not exist.
data_dir
包含在list.files()
function 中,则会产生以下错误: list_files 中的Error in list_files(file, ignore_missing, TRUE, verbosity): File '' does not exist.
I'm not sure why that is, but leaving it out works for these JSON files.# Load libraries
library(readtext)
library(quanteda)
# Define a set of texts explicitly
data_dir <- "C:/Users/Nathan/Documents/R/corpus_linguistics/"
extracted_texts <- readtext(paste0(data_dir, "texts_unmodified/*.json"), text_field = "maintext", verbosity = 3)
my_corpus <- corpus(extracted_texts)
Input: 5 files consisting of 4 w/o an empty or null text_field
and 1 file with a null text field
.输入: 5 个文件,其中 4 个没有空或
text_field
文本字段和 1 个带有 null text field
的文件。 In addition, all of the files have Western European (Windows) 1252 Encoding.此外,所有文件都具有西欧 (Windows) 1252 编码。
Errors:错误:
Reading texts from C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/*.json
, using glob pattern
... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_02_17__1589147645.json
File doesn't contain a single valid JSON object.
contain a single valid JSON object.
... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_03_13__1589150325.json
File doesn't contain a single valid JSON object.
Column 14 ['maintext'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform. ... read 5 documents.
Result: a properly formed corpus consisting of 5 documents.结果:由 5 个文档组成的正确格式的语料库。 One document lacks either tokens or types.
一个文档缺少标记或类型。 The corpus seems to build properly despite the errors.
尽管存在错误,但语料库似乎可以正常构建。 Perhaps some special characters don't display properly because of the encoding issue.
由于编码问题,可能某些特殊字符无法正确显示。 I was not able to check this.
我无法检查这一点。
Input files: 4 files that have no empty or null JSON fields.输入文件: 4 个没有空字段或 null JSON 字段的文件。 In all cases,
text_field
contains text and the title_rss
field was removed.在所有情况下,
text_field
都包含文本,并且title_rss
字段已被删除。 Each of the files was converted from Western European (Windows) 1252 into Unicode UTF-8-65001.每个文件都从西欧 (Windows) 1252 转换为 Unicode UTF-8-65001。
Errors: NONE!错误:无!
Result: A properly formed corpus.结果:正确形成的语料库。
Many thanks to the two developers for detailed feedback and useful leads.非常感谢两位开发人员的详细反馈和有用的线索。 The assistance is deeply appreciated.
对援助深表感谢。
There are a few possibilities here, but the most likely are:这里有几种可能性,但最有可能的是:
One of your files has a malformed JSON structure, from the point of view of readtext()
.从
readtext()
的角度来看,您的一个文件具有格式错误的 JSON 结构。 Even though this might be OK from a strictly JSON format, if one of your text fields is empty, for instance, then this will cause the error.即使从严格的 JSON 格式来看这可能没问题,例如,如果您的文本字段之一为空,那么这将导致错误。 (See below for a demonstration and a solution.)
(请参阅下面的演示和解决方案。)
While readtext()
can take a "glob" pattern match , list.files()
takes a regular expression.虽然
readtext()
可以采用“glob”模式匹配, list.files()
采用正则表达式。 It's possible (but unlikely) that you are picking up something you don't want then in list.files(pattern = "*.json"...
. But this should not be necessary with readtext()
-- see below.您有可能(但不太可能)在
list.files(pattern = "*.json"...
中拾取不需要的东西。但这对于readtext()
来说是不必要的——见下文。
To demonstrate, let's write out each document in data_corpus_inaugural
as a separate JSON file, and then read them in using readtext()
.为了演示,让我们将
data_corpus_inaugural
中的每个文档写成单独的 JSON 文件,然后使用readtext()
将它们读入。
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tmpdir <- tempdir()
corpdf <- convert(data_corpus_inaugural, to = "data.frame")
for (d in corpdf$doc_id) {
cat(jsonlite::toJSON(dplyr::filter(corpdf, doc_id == d)),
file = paste0(tmpdir, "/", d, ".json")
)
}
head(list.files(tmpdir))
## [1] "1789-Washington.json" "1793-Washington.json" "1797-Adams.json"
## [4] "1801-Jefferson.json" "1805-Jefferson.json" "1809-Madison.json"
To read them in, you can use the "glob" pattern patch here and just read the JSON files.要读取它们,您可以在此处使用“glob”模式补丁,然后只需读取 JSON 文件即可。
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
text_field = "text", docid_field = "doc_id"
)
summary(corpus(rt), n = 5)
## Corpus consisting of 58 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington.json 625 1537 23 1789 Washington George
## 1793-Washington.json 96 147 4 1793 Washington George
## 1797-Adams.json 826 2577 37 1797 Adams John
## 1801-Jefferson.json 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson.json 804 2380 45 1805 Jefferson Thomas
## Party
## none
## none
## Federalist
## Democratic-Republican
## Democratic-Republican
So that all worked fine.所以一切都很好。
But if we add to this one file whose text field is empty, then this produces the error in question:但是如果我们添加到这个文本字段为空的文件中,那么这会产生有问题的错误:
cat('[ { "doc_id" : "d1", "text" : "this is a file" },
{ "doc_id" : "d2", "text" : } ]',
file = paste0(tmpdir, "/badfile.json")
)
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
text_field = "text", docid_field = "doc_id"
)
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.
True, that was not a valid JSON file, since it contained a tag with no value.没错,这不是一个有效的 JSON 文件,因为它包含一个没有值的标签。 But I suspect you have something like that in one of your files.
但我怀疑你的一个文件中有类似的东西。
Here's how you can identify the problem: loop through your b
(from the question, not as I've specified it below).这是您识别问题的方法:遍历您的
b
(来自问题,而不是我在下面指定的)。
b <- tail(list.files(tmpdir, pattern = ".*\\.json", full.names = TRUE))
for (f in b) {
cat("Reading:", f, "\n")
rt <- readtext::readtext(f, text_field = "text", docid_field = "doc_id")
}
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2001-Bush.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2005-Bush.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2009-Obama.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2013-Obama.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2017-Trump.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/badfile.json
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.