简体   繁体   中英

Converting a large CSV file to multiple JSON files using Python

I am currently using the following code to convert a large CSV file to a JSON file.

import csv 
import json 

def csv_to_json(csvFilePath, jsonFilePath):
    jsonArray = []
      
    with open(csvFilePath, encoding='utf-8') as csvf: 
        csvReader = csv.DictReader(csvf) 

        for row in csvReader: 
            jsonArray.append(row)
    with open(jsonFilePath, 'w', encoding='utf-8') as jsonf: 
        jsonString = json.dumps(jsonArray, indent=4)
        jsonf.write(jsonString)
          
csvFilePath = r'test_data.csv'
jsonFilePath = r'test_data.json'
csv_to_json(csvFilePath, jsonFilePath)

This code works fine and I am able to convert the CSV to JSON without any issues. However, as the CSV file contains 600,000+ rows and hence as many items in my JSON, it has become very difficult to manage the JSON file.

I would like to modify my above code such that for every 5000 rows of the CSV, the data is written into a new JSON file. Ideally, I would be having 120 (600,000/5000) JSON files in this case.

How can I do the same?

Split up your read\write methods and add a simple threshold:

JSON_ENTRIES_THRESHOLD = 5000  # modify to whatever you see suitable

def write_json(json_array, filename):
    with open(filename, 'w', encoding='utf-8') as jsonf: 
        json.dump(json_array, jsonf)  # note the usage of .dump directly to a file descriptor

def csv_to_json(csvFilePath, jsonFilePath):
    jsonArray = []

    with open(csvFilePath, encoding='utf-8') as csvf: 
        csvReader = csv.DictReader(csvf) 
        filename_index = 0
    
        for row in csvReader:
            jsonArray.append(row)
            if len(jsonArray) >= JSON_ENTRIES_THRESHOLD:
                # if we reached the treshold, write out
                write_json(jsonArray, f"jsonFilePath-{filename_index}.json")
                filename_index += 1
                jsonArray = []
            
        # Finally, write out the remainder
        write_json(jsonArray, f"jsonFilePath-{filename_index}.json")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM