简体   繁体   English

如何在 python 中以特定格式读取 JSON 文件作为 stream

[英]How to read a JSON file in python as stream in chunks with specific format

I have a huge file ~8 GB in JSON and I want to read it as stream with chunks of 1000 examples at a time.我在 JSON 中有一个约 8 GB 的大文件,我想将其读取为 stream,一次包含 1000 个示例。 So I searched a lot and tried several packages but not of them really did the job.所以我搜索了很多并尝试了几个包,但没有一个真的能完成这项工作。

The format of my file is as follows:我的文件格式如下:

{
    "Elem1": [
       {
            "orgs": [],
       },
       {
           "people":[]
       },
    ],
   "Elem2"":[
       {
            "orgs": [],
       },
       {
           "people":[]
       },
    ],
...
}

As you can see, each element is saved as a tuple with two dicts and reoccurring keys in it.如您所见,每个element都保存为一个元组,其中包含两个字典和重复出现的键。 Is there a way how I could read/load/process the file above in chunks of elements ie chunk_1 = [ Elem1, Elem2, ... ] into the RAM and get the values for the keys out of them?有没有办法我可以读取/加载/处理上面的文件中的元素块,即chunk_1 = [ Elem1, Elem2, ... ]到 RAM 中并从中获取键的值? Any ideas how to do that?任何想法如何做到这一点? Would appreciate your help.感谢您的帮助。

Best regards Chris最好的问候克里斯

As Serge said, you will need a custom parser to do the job.正如 Serge 所说,您将需要一个自定义解析器来完成这项工作。 Something like below:如下所示:

stack = []

json_string = ""
count = 0
with open(filename) as f:
  while True:
    c = f.read(1)
    if c == '{' or c == '[':
      stack.append(c)
    elif c == '}' or c == ']':
      stack.pop()
    json_string += c
    if len(stack) == 1:
      json_string += '}'
      count += 1
    if count == DESIRED_COUNT :
      break

The final json_string will contain the json with DESIRED_COUNT of objects最终的json_string将包含 json 和 DESIRED_COUNT 个对象

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM