简体   繁体   English

如何使用 Python 解析 WIkidata JSON (.bz2) 文件?

[英]How to parse WIkidata JSON (.bz2) file using Python?

I want to look at entities and relationships using Wikidata.我想使用维基数据查看实体和关系。 I downloaded the Wikidata JSON dump ( from here .bz2 file, size ~ 18 GB).我下载了维基数据 JSON 转储(从这里.bz2 文件,大小 ~ 18 GB)。

However, I cannot open the file, it's just too big for my computer.但是,我无法打开该文件,它对我的​​计算机来说太大了。

Is there a way to look into the file without extracting the full .bz2 file.有没有办法在不提取完整的 .bz2 文件的情况下查看文件。 Especially using Python , I know that there is a PHP dump reader ( here ), but I can't use it.特别是使用Python 时,我知道有一个 PHP dump reader( 这里),但我无法使用它。

you can use BZ2File interface to manipulate the compressed file. 您可以使用BZ2File接口来操作压缩文件。 But you can NOT use json module to access information for it, it will take too much space. 但是你不能使用json模块来访问它的信息,它会占用太多空间。 You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module. 您必须索引文件,这意味着您必须逐行读取文件并在Dictionary(哈希表)中保存有趣对象的位置和长度,然后您可以提取给定对象并使用json模块加载它。

I came up with a strategy that allows to use json module to access information without opening the file: 我想出了一个策略,允许使用json模块访问信息而无需打开文件:

import bz2
import json

with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
    if i == 10: break
    tweets = json.loads(line)
    lines.append(tweets)

In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need. 这样lines将是字典,你可以伊斯利操作,例如,通过删除不需要的键减小其尺寸的列表。

Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. 还要注意(显然)条件i==10可以被任意改变以适合任何人(?)的需要。 For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. 例如,您可以一次解析一些行,分析它们并在txt文件上写入您真正想要从原始文件中获取的行的索引。 Than it will be sufficient to read only those lines (using a similar condition in i in the for loop). 比仅读取那些行就足够了(在for循环中使用i中的类似条件)。

You'd have to do line-by-line processing:您必须逐行处理:

import bz2
import json

path = "latest.json.bz2"

with bz2.BZ2File(path) as file:
    for line in file:
        line = line.decode().strip()

        if line in {"[", "]"}:
            continue
        if line.endswith(","):
            line = line[:-1]
        entity = json.loads(line)

        # do your processing here
        print(str(entity)[:50] + "...")

Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:看到 WikiData 现在是 70GB+,您可能希望直接从 URL 处理它:

import bz2
import json
from urllib.request import urlopen

path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"

with urlopen(path) as stream:
    with bz2.BZ2File(path) as file:
        ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM