简体   繁体   中英

How to parse WIkidata JSON (.bz2) file using Python?

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump ( from here .bz2 file, size ~ 18 GB).

However, I cannot open the file, it's just too big for my computer.

Is there a way to look into the file without extracting the full .bz2 file. Especially using Python , I know that there is a PHP dump reader ( here ), but I can't use it.

you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.

I came up with a strategy that allows to use json module to access information without opening the file:

import bz2
import json

with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
    if i == 10: break
    tweets = json.loads(line)
    lines.append(tweets)

In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.

Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).

You'd have to do line-by-line processing:

import bz2
import json

path = "latest.json.bz2"

with bz2.BZ2File(path) as file:
    for line in file:
        line = line.decode().strip()

        if line in {"[", "]"}:
            continue
        if line.endswith(","):
            line = line[:-1]
        entity = json.loads(line)

        # do your processing here
        print(str(entity)[:50] + "...")

Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:

import bz2
import json
from urllib.request import urlopen

path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"

with urlopen(path) as stream:
    with bz2.BZ2File(path) as file:
        ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM