简体   繁体   English

如何读取 python 中的大型 Json 文件以获取某些值

[英]How to read a large Json file in python to fetch certain values

I have a large Json file in the form of list of lists.我有一个列表形式的大型 Json 文件。 This is airport codes and its mapping with its city, country, lat, long etc values.这是机场代码及其与其城市、国家、纬度、经度等值的映射。 Here is a sample of it looks:这是它的外观示例:

[["Goroka", "Goroka", "Papua New Guinea", "GKA", "AYGA", "-6.081689", "145.391881", "5282", "10", "U", "Pacific/Port_Moresby"], ["Asaba Intl", "Asaba", "Nigeria", "ABB", "DNAS", "6.2033333", "6.6588889", "0", "1", "U", "Africa/Lagos"], ["Downtown Airpark", "Oklahoma", "United States", "DWN", "", "35.4491997", "-97.5330963", "3240", "-6", "U", "America/Chicago"], ["Mbeya", "Mbeya", "Tanzania", "MBI", "HTMB", "-8.9169998", "33.4669991", "4921", "3", "U", "Africa/Dar_es_Salaam"], ["Tazadit", "Zouerate", "Mauritania", "OUZ", "GQPZ", "22.7563992", "-12.4835997", "", "0", "U", "Africa/Nouakchott"], ["Wadi Al-Dawasir", "Wadi al-Dawasir", "Saudi Arabia", "WAE", "OEWD", "20.5042992", "45.1996002", "10007", "3", "U", "Asia/Riyadh"], ["Madang", "Madang", "Papua New Guinea", "MAG", "AYMD", "-5.207083", "145.7887", "20", "10", "U", "Pacific/Port_Moresby"], ["Mount Hagen", "Mount Hagen", "Papua New Guinea", "HGU", "AYMH", "-5.826789", "144.295861", "5388", "10", "U", "Pacific/Port_Moresby"], ["Nadzab", "Nadzab", "Papua New Guinea", "LAE", "AYNZ", "-6.569828", "146.726242", "239", "10", "U", "Pacific/Port_Moresby"], ["Port Moresby Jacksons Intl", "Port Moresby", "Papua New Guinea", "POM", "AYPY", "-9.443383", "147.22005", "146", "10", "U", "Pacific/Port_Moresby"]

Each list is of form:每个列表的格式为:

['name', 'city', 'country', 'iata', 'icao', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzdb']

Where I am concerned with 'iata' and 'country' values of the list.我关心列表中的“iata”和“国家”值。

The code aims to provide a string variable with a particular iata code and then I want to read this json file, look up the list where that iata code appears and fetch the corresponding 'country' value from it.该代码旨在提供具有特定 iata 代码的字符串变量,然后我想读取此 json 文件,查找该 iata 代码出现的列表并从中获取相应的“国家”值。

This file would have most of the airport codes in world so while not in 10s of GB, it still has lots of lists.该文件将包含世界上大多数机场代码,因此虽然不是 10 GB,但它仍然有很多列表。

I have this way to read the json in python:我有这种方法可以读取 python 中的 json:

import json

with open('airport_list.json','r') as airport_list:
    airport_dict = json.loads(airport_list.read())
    

Problem is this will load whole json in memory.问题是这会将整个 json 加载到 memory 中。 I can try iterating it over the json iterator which would read line by line, but then how will I look up a string variable with iata code to a particular list in json?我可以尝试通过 json 迭代器对其进行迭代,该迭代器将逐行读取,但是我将如何在 json 中的特定列表中查找带有 iata 代码的字符串变量?

Is there a better and efficient way to do this?有没有更好更有效的方法来做到这一点?

If the objective is to avoid loading the whole file onto memory, then it can be done in one of the following ways:如果目标是避免将整个文件加载到 memory 上,则可以通过以下方式之一完成:

  1. Use Ijson which is "an iterative JSON parser with standard Python iterator interfaces."使用Ijson ,它是"an iterative JSON parser with standard Python iterator interfaces."

  2. Use a document DB to dump the json file and then read from it.使用文档数据库转储 json 文件,然后从中读取。 You could use TinyDB for that.您可以为此使用TinyDB

  3. Or you could read and process it in chunks, something like this:或者您可以分块读取和处理它,如下所示:

from functools import partial

def custom_operation(text):
  """
  TODO: Find last '],' , process text before '],' to
  find the names and return the text after it as residual
  """
  matches, residual = [], residual
  return matches, residual

def readfile(filename)
  with open(filename, 'r') as fh:
      filepart = partial(fh.read, 1024*1024)
      iterator = iter(filepart, b'')

      residual = ''
      for index, block in enumerate(iterator, start=1):
        matches, residual = custom_operation('%s%s' % (residual, block))
        yield matches

Hope that helps!希望有帮助!

In order to find the list in this json that contains a specific 'iata', you can iterate through it as a text file in byte-chunks, parsing each chunk to see if it has what you need.为了在此 json 中找到包含特定“iata”的列表,您可以将其作为字节块中的文本文件进行迭代,解析每个块以查看它是否具有您需要的内容。

Unfortunately, if the 'iata' occurs near the end of the list, then you'll still have to read your way through the whole file, although it won't all be in-memory at once.不幸的是,如果“iata”出现在列表的末尾附近,那么您仍然需要阅读整个文件,尽管它不会一次全部在内存中。

If this is a lookup that you need to do many times, it would probably be worth it to generate a dict with the iatas as keys and countries as values.如果这是您需要多次执行的查找,那么生成一个以 iatas 作为键和国家作为值的dict可能是值得的。 Because dictionary are hash tables, performing this sort of lookup is a very efficient task, and you'd significantly decrease the file size by only using the two elements iata and country.因为字典是 hash 表,所以执行这种查找是一项非常有效的任务,并且仅使用 iata 和 country 这两个元素可以显着减小文件大小。

Nevertheless, if I haven't dissuaded you from this course, here are functions that should parse this json as a text file in chunks, and return the country code from the iata, assuming that iatas are unique .尽管如此,如果我没有劝阻你参加本课程,这里有一些函数应该将这个 json 解析为块中的文本文件,并从 iata 返回国家代码,假设 iatas 是唯一的

def read_in_chunks(file_object, chunk_size):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

def parse_chunk(chunk, iata):
    if iata in chunk:
        pieces = [x.strip() for x in chunk.split(',')]
        if pieces[0] != iata:
            if pieces[pieces.index(f'"{iata}"')-1].startswith('"'):
                return pieces[pieces.index(f'"{iata}"')-1].replace('"', '')
            else:
                return "fragment"
        else:
            return None


def country_from_iata(iata):    
    count = 0
    
    # attempt to find the element immediately prior to the iata
    with open('example.json', 'rt') as f:
        for chunk in read_in_chunks(f, 64):
            parsed = parse_chunk(chunk, iata)
            if parsed:
                break
            count += 64
    
    # if the element was split, then shift half an iteration to the left.
    if parsed == "fragment":
        with open('example.json', 'rt') as f:
            f.seek(count-32)
            for chunk in read_in_chunks(f, 64):
                parsed = parse_chunk(chunk, iata)
                if parsed:
                    break
    
    return parsed

country_from_iata("LAE") # 'Papua New Guinea'

I would personally recommend the library pandas for this kind of task.我个人会推荐图书馆pandas来完成这类任务。 It has a built-in function for reading JSON ( read_json ) and tends to be more efficient than the standard library JSON offerings.它有一个内置的 function 用于读取 JSON ( read_json ),并且往往比标准库 JSON 产品更有效。 Moreover, you can customize it pretty heavily for your exact use case.此外,您可以根据您的确切用例对其进行大量定制。

Here is a reference to the Pandas read_json function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html . Here is a reference to the Pandas read_json function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM