简体   繁体   中英

Python: How to read from compressed json .gz file and write to json file

I want to read from compressed.json.gz file and write its decoded file into.json file

.json.gz files:

  • data/sample1.gz
  • data/sample2.gz

write to.json files

  • data/sample1.json
  • data/sample2.json

I had a requirement where I have a list of compressed json.gz files. I need to uncompress it and convert it back to json files with the same file name. Below mentioned code is working.

Place this script in the folder containing.gz files and run it using python3. It will work.

file: script.py

import gzip
import os

def get_file_names_by_extension(path = ".", file_extension = ".gz"):
    file_names = []
    for x in os.listdir(path):
        if x.endswith(file_extension):
            file_names.append(x)
    return file_names

def write_file(data, destination_path, file_name, encoding = "utf-8"):
    output_file_name = "/".join([destination_path, file_name])
    print(output_file_name)
    with open(output_file_name, "w") as outfile:
        outfile.write(data.encode(encoding))

def decompress_files(files, destination_path, output_format = ".json", encoding = "utf-8"):
    for file in files:
        _file = gzip.GzipFile(file, "rb")
        content = _file.read()
        content = content.decode(encoding)
        output_file_name = "".join([file.split(".")[0], output_format])
        write_file(content, destination_path, output_file_name, encoding)

        
files = get_file_names_by_extension(path=".", file_extension=".gz")
decompress_files(files, ".", ".json")

Pyspark can infer that the json files are gzipped from the file name. You can read the data then write it back without any compression to get the results you want. The benefit of doing this in Spark is that it can use multiple workers to read/write the data in parallel, especially if the data is in S3.

df = spark.read.json("data/")
df.write.json("data/", mode="append", compression="none")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM