简体   繁体   English

怎么读大json?

[英]How to read big json?

I receive json-files with data to be analyzed in R, for which I use the RJSONIO-package: 我收到了带有要在R中分析的数据的json文件,为此我使用了RJSONIO包:

library(RJSONIO)
filename <- "Indata.json"
jFile <- fromJSON(filename)

When the json-files are larger than about 300MB (uncompressed), my computer starts to use the swap memory and continues the parsing (fromJSON) for hours. 当json文件大于约300MB(未压缩)时,我的计算机开始使用交换内存并继续解析(fromJSON)几个小时。 A 200MB-file takes only about one minute to parse. 一个200MB的文件只需要一分钟就可以解析。

I use R 2.14 (64bit) on Ubuntu 64bit with 16GB RAM, so I'm surprised that swapping is needed already at about 300MB of json. 我在Ubuntu 64bit和16GB RAM上使用R 2.14(64位),所以我很惊讶需要在大约300MB的json上进行交换。

What can I do to read big jsons? 我怎么能读大jsons? Is there something in the memory-settings that mess things up? 内存设置中有什么东西搞砸了吗? I have restarted R and run only the three lines above. 我已重新启动R并仅运行上面的三行。 The json-file contain 2-3 columns with short strings, and 10-20 columns with numbers from 0 to 1000000. Ie it is the number of rows that makes the large size (more than a million rows in the parsed data). json文件包含2-3列短字符串,10-20列包含0到1000000之间的数字。也就是说,它是大小的行数(在解析数据中超过一百万行)。


Update: From the comments I learned that rjson is done more in C, so I tried it. 更新:从评论中我了解到rjson在C中完成的更多,所以我试了一下。 A 300MB file that with RJSONIO (according to Ubuntu System Monitor) reached 100% memory use (from 6% baseline) and went on to swapping, needed only 60% memory with package rjson and the parsing was done in reasonable time (minutes). 带有RJSONIO(根据Ubuntu系统监视器)的300MB文件达到了100%的内存使用率(从6%基线开始)并继续进行交换,只需要60%的内存和rjson包,并且在合理的时间(分钟)内完成解析。

Although your question doesn't specify this detail, you may want to make sure that loading the entire JSON in memory is actually what you want. 虽然您的问题没有指定此详细信息,但您可能希望确保在内存中加载整个JSON实际上是您想要的。 It looks like RJSONIO is a DOM-based API. 看起来RJSONIO是一个基于DOM的API。

What computation do you need to do? 你需要做什么计算? Can you use a streaming parser? 你能用流解析器吗? An example of a SAX-like streaming parser for JSON is yajl . 用于JSON的类似SAX的流式解析器的示例是yajl

Even though the question is very old, this might be of use for someone with a similar problem. 尽管这个问题非常陈旧,但这可能对有类似问题的人有用。

The function jsonlite::stream_in() allows to define pagesize to set the number of lines read at a time, and a custom function that is applied to this subset in each iteration can be provided as handler . 函数jsonlite::stream_in()允许定义pagesize来设置一次读取的行数,并且可以将每次迭代中应用于此子集的自定义函数作为handler提供。 This allows working with very large JSON-files without reading everything into memory at the same time. 这允许使用非常大的JSON文件,而无需同时将所有内容读入内存。

stream_in(con, pagesize = 5000, handler = function(x){
    # Do something with the data here
})

Not on the memory size, but on the speed, for the quite small iris dataset (only 7088 bytes), the RJSONIO package is an order of magnitude slower than rjson . 不是内存大小,而是速度,对于非常小的iris数据集(仅7088字节), RJSONIO包比rjson慢一个数量级。 Don't use the method 'R' unless you really have to! 除非你真的需要,否则不要使用方法'R'! Note the different units in the two sets of results. 注意两组结果中的不同单位。

library(rjson) # library(RJSONIO)
library(plyr)
library(microbenchmark)
x <- toJSON(iris)
(op <- microbenchmark(CJ=toJSON(iris), RJ=toJSON(iris, method='R'),
  JC=fromJSON(x), JR=fromJSON(x, method='R') ) )

# for rjson on this machine...
Unit: microseconds
  expr        min          lq     median          uq        max
1   CJ    491.470    496.5215    501.467    537.6295    561.437
2   JC    242.079    249.8860    259.562    274.5550    325.885
3   JR 167673.237 170963.4895 171784.270 172132.7540 190310.582
4   RJ    912.666    925.3390    957.250   1014.2075   1153.494

# for RJSONIO on the same machine...
Unit: milliseconds
  expr      min       lq   median       uq      max
1   CJ 7.338376 7.467097 7.563563 7.639456 8.591748
2   JC 1.186369 1.234235 1.247235 1.265922 2.165260
3   JR 1.196690 1.238406 1.259552 1.278455 2.325789
4   RJ 7.353977 7.481313 7.586960 7.947347 9.364393

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM