如何读取 python 中的大型 Json 文件以获取某些值

Question

我有一个列表形式的大型 Json 文件。 这是机场代码及其与其城市、国家、纬度、经度等值的映射。 这是它的外观示例：

[["Goroka", "Goroka", "Papua New Guinea", "GKA", "AYGA", "-6.081689", "145.391881", "5282", "10", "U", "Pacific/Port_Moresby"], ["Asaba Intl", "Asaba", "Nigeria", "ABB", "DNAS", "6.2033333", "6.6588889", "0", "1", "U", "Africa/Lagos"], ["Downtown Airpark", "Oklahoma", "United States", "DWN", "", "35.4491997", "-97.5330963", "3240", "-6", "U", "America/Chicago"], ["Mbeya", "Mbeya", "Tanzania", "MBI", "HTMB", "-8.9169998", "33.4669991", "4921", "3", "U", "Africa/Dar_es_Salaam"], ["Tazadit", "Zouerate", "Mauritania", "OUZ", "GQPZ", "22.7563992", "-12.4835997", "", "0", "U", "Africa/Nouakchott"], ["Wadi Al-Dawasir", "Wadi al-Dawasir", "Saudi Arabia", "WAE", "OEWD", "20.5042992", "45.1996002", "10007", "3", "U", "Asia/Riyadh"], ["Madang", "Madang", "Papua New Guinea", "MAG", "AYMD", "-5.207083", "145.7887", "20", "10", "U", "Pacific/Port_Moresby"], ["Mount Hagen", "Mount Hagen", "Papua New Guinea", "HGU", "AYMH", "-5.826789", "144.295861", "5388", "10", "U", "Pacific/Port_Moresby"], ["Nadzab", "Nadzab", "Papua New Guinea", "LAE", "AYNZ", "-6.569828", "146.726242", "239", "10", "U", "Pacific/Port_Moresby"], ["Port Moresby Jacksons Intl", "Port Moresby", "Papua New Guinea", "POM", "AYPY", "-9.443383", "147.22005", "146", "10", "U", "Pacific/Port_Moresby"]

每个列表的格式为：

['name', 'city', 'country', 'iata', 'icao', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzdb']

我关心列表中的“iata”和“国家”值。

该代码旨在提供具有特定 iata 代码的字符串变量，然后我想读取此 json 文件，查找该 iata 代码出现的列表并从中获取相应的“国家”值。

该文件将包含世界上大多数机场代码，因此虽然不是 10 GB，但它仍然有很多列表。

我有这种方法可以读取 python 中的 json：

import json

with open('airport_list.json','r') as airport_list:
    airport_dict = json.loads(airport_list.read())

问题是这会将整个 json 加载到 memory 中。 我可以尝试通过 json 迭代器对其进行迭代，该迭代器将逐行读取，但是我将如何在 json 中的特定列表中查找带有 iata 代码的字符串变量？

有没有更好更有效的方法来做到这一点？

Answer 1

如果目标是避免将整个文件加载到 memory 上，则可以通过以下方式之一完成：

使用Ijson ，它是"an iterative JSON parser with standard Python iterator interfaces."
使用文档数据库转储 json 文件，然后从中读取。 您可以为此使用TinyDB 。
或者您可以分块读取和处理它，如下所示：

from functools import partial

def custom_operation(text):
  """
  TODO: Find last '],' , process text before '],' to
  find the names and return the text after it as residual
  """
  matches, residual = [], residual
  return matches, residual

def readfile(filename)
  with open(filename, 'r') as fh:
      filepart = partial(fh.read, 1024*1024)
      iterator = iter(filepart, b'')

      residual = ''
      for index, block in enumerate(iterator, start=1):
        matches, residual = custom_operation('%s%s' % (residual, block))
        yield matches

希望有帮助！

Answer 2

为了在此 json 中找到包含特定“iata”的列表，您可以将其作为字节块中的文本文件进行迭代，解析每个块以查看它是否具有您需要的内容。

不幸的是，如果“iata”出现在列表的末尾附近，那么您仍然需要阅读整个文件，尽管它不会一次全部在内存中。

如果这是您需要多次执行的查找，那么生成一个以 iatas 作为键和国家作为值的dict可能是值得的。 因为字典是 hash 表，所以执行这种查找是一项非常有效的任务，并且仅使用 iata 和 country 这两个元素可以显着减小文件大小。

尽管如此，如果我没有劝阻你参加本课程，这里有一些函数应该将这个 json 解析为块中的文本文件，并从 iata 返回国家代码，假设 iatas 是唯一的。

def read_in_chunks(file_object, chunk_size):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

def parse_chunk(chunk, iata):
    if iata in chunk:
        pieces = [x.strip() for x in chunk.split(',')]
        if pieces[0] != iata:
            if pieces[pieces.index(f'"{iata}"')-1].startswith('"'):
                return pieces[pieces.index(f'"{iata}"')-1].replace('"', '')
            else:
                return "fragment"
        else:
            return None


def country_from_iata(iata):    
    count = 0
    
    # attempt to find the element immediately prior to the iata
    with open('example.json', 'rt') as f:
        for chunk in read_in_chunks(f, 64):
            parsed = parse_chunk(chunk, iata)
            if parsed:
                break
            count += 64
    
    # if the element was split, then shift half an iteration to the left.
    if parsed == "fragment":
        with open('example.json', 'rt') as f:
            f.seek(count-32)
            for chunk in read_in_chunks(f, 64):
                parsed = parse_chunk(chunk, iata)
                if parsed:
                    break
    
    return parsed

country_from_iata("LAE") # 'Papua New Guinea'

Answer 3

我个人会推荐图书馆pandas来完成这类任务。 它有一个内置的 function 用于读取 JSON ( read_json )，并且往往比标准库 JSON 产品更有效。 此外，您可以根据您的确切用例对其进行大量定制。

Here is a reference to the Pandas read_json function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html .

如何读取 python 中的大型 Json 文件以获取某些值

问题描述

3 个解决方案

解决方案1
1 2020-06-27 08:22:56

解决方案2
1 2020-06-27 08:30:17

解决方案3
0 2020-06-27 07:01:07

如何读取 python 中的大型 Json 文件以获取某些值

问题描述

3 个解决方案

解决方案1 1 2020-06-27 08:22:56

解决方案2 1 2020-06-27 08:30:17

解决方案3 0 2020-06-27 07:01:07

解决方案1
1 2020-06-27 08:22:56

解决方案2
1 2020-06-27 08:30:17

解决方案3
0 2020-06-27 07:01:07