简体   繁体   English

如何打开Python中的.ndjson文件?

[英]How to open .ndjson file in Python?

I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool.我有一个 20GB 的.ndjson文件,我想用 Python 打开它。文件太大了,所以我找到了一种方法,可以用一个在线工具将它分成 50 个和平文件。 This is the tool: https://p.netools.com/split-files这是工具: https://p.netools.com/split-files

Now I get one file, that has extension .ndjson.000 (and I do not know what is that)现在我得到一个文件,扩展名为.ndjson.000 (我不知道那是什么)

I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.我试图将其作为 json 或 csv 文件打开,以在 pandas 中读取它,但它不起作用。 Do you have any idea how to solve this?你知道如何解决这个问题吗?

import json
import pandas as pd

First approach:第一种方法:

df = pd.read_json('dump.ndjson.000', lines=True)

Error: ValueError: Unmatched ''"' when when decoding 'string'错误: ValueError: Unmatched ''"' when when decoding 'string'

Second approach:第二种方法:

with open('dump.ndjson.000', 'r') as f:

     my_data = f.read() 

print(my_data)

Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)错误: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)

I think the problem is that I have some emojis in my file, so I do not know how to encode them?我认为问题是我的文件中有一些表情符号,所以我不知道如何对它们进行编码?

ndjson is now supported out of the box with argument lines=True ndjson 现在支持开箱即用的参数lines=True

import pandas as pd

df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)

I think the pandas.read_json cannot handle ndjson correctly.我认为 pandas.read_json 无法正确处理 ndjson。

According to this issue you can do sth.根据这个问题你可以做某事。 like this to read it.像这样阅读它。

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

PS: All credits for this code go to KristianHolsheimer from the Github Issue PS:此代码 go 的所有学分来自 Github Issue 的 KristianHolsheimer

The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files. ndjson (newline delimited) json 是一种 json-lines 格式,即每一行都是一个 json。它非常适合缺乏刚性结构('non-sql')的数据集,其中文件大小足以保证多个文件.

You can use pandas:您可以使用 pandas:

import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)

In case your json strings do not contain newlines, you can alternatively use:如果您的 json 字符串不包含换行符,您也可以使用:

import json
with open("dump.ndjson.000") as f:
    data = [json.loads(l) for l in f.readlines()]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM