大型 ndjson 文件无法在 Python 中正确加载

Question

I have a json file with a size of 5 GB.我有一个 json 大小为 5 GB 的文件。 I would like to load it and do some EDA on it in order to figure out where the relevant information is.我想加载它并对其进行一些 EDA，以便找出相关信息的位置。

I tried:我试过：

import json
import pprint

json_fn = 'abc.ndjson'
data = json.load(open(json_fn, 'rb'))
pprint.pprint(data, depth=2)

but this just crashes with但这只是崩溃

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

I also tried:我也试过：

import ijson

with open(json_fn) as f:
    items = ijson.items(f, 'item', multiple_values=True)  # "multiple values" needed as it crashes otherwise with a "trailing garbage parse error" (https://stackoverflow.com/questions/59346164/ijson-fails-with-trailing-garbage-parse-error)
    print('Data loaded - no processing ...')
    print("---items---")
    print(items)
    for item in items:
        print("---item---")
        print(item)

But this just returns:但这只是返回：

Data loaded, now importing
---items---
<_yajl2.items object at 0x7f436de97440>

Process finished with exit code 0

The ndjson file contains valid ascii characters (as inspected with vi) but very long lines and is therefore not really comprehensible from a text editor. ndjson 文件包含有效的 ascii 字符（如使用 vi 检查的那样）但行非常长，因此文本编辑器无法真正理解。

The file starts like:该文件开头如下：

{"visitId":257057,"staticFeatures":[{"type":"CODES","value":"9910,51881,42833,486,4280,42731,2384,V5861,9847,3962,49320,3558,2720,4019,99092"},{"type":"visitID","value":"357057"},{"type":"VISITOR_ID","value":"68824"}, {"type":"ADMISSION_ID","value":"788457"},{"type":"AGE","value":"34"}, ...

What am I doing wrong and how can I process this file?我做错了什么，我该如何处理这个文件？

Answer 1

You are using prefix item .您正在使用前缀item 。 For this to work json should have list as a top level element.为此，json 应该将列表作为顶级元素。

For example see below json例如见下面 json

data2.json数据2.json

[
    {
      "Identifier": "21979c09fc4e6574"
    },
    {
      "Identifier": "e6235cce58ec8b9c"
    }
 ]

Code:代码：

with open('data2.json') as fp:
    items = ijson.items(fp, 'item')
    for x in items:
        print(x)

Output: Output：

{'Identifier': '21979c09fc4e6574'}
{'Identifier': 'e6235cce58ec8b9c'}

Another Example另一个例子

data.json数据.json

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city", "info": {  }},
      {"name": "Thames", "type": "river", "info": {  }}
    ],
    "america": [
      {"name": "Texas", "type": "state", "info": {  }}
    ]
  }
}

Above json doesn't have list as top level element so I should provide the valid prefix to the ijson.items() .以上 json 没有列表作为顶级元素，因此我应该为ijson.items()提供有效前缀。 prefix should be 'earth.europe.item'前缀应为'earth.europe.item'

Code:代码：

with open('data.json') as fp:
    items = ijson.items(fp, 'earth.europe.item')
    for x in items:
        print(x)

Output: Output：

{'name': 'Paris', 'type': 'city', 'info': {}}
{'name': 'Thames', 'type': 'river', 'info': {}}

大型 ndjson 文件无法在 Python 中正确加载

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-18 10:48:50

大型 ndjson 文件无法在 Python 中正确加载

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-18 10:48:50

解决方案1
1 已采纳 2020-08-18 10:48:50