简体   繁体   中英

Convert non-nested json to csv file?

I am working with a non-nested json file, the data is from reddit. I am trying to convert it to csv file using python. Each row is not having the same fields and therefore keep getting the error as:

JSONDecodeError: Extra data: line 2 column 1

Here is the code:

import csv
import json
import os

os.chdir('c:\\Users\\Desktop')
infile = open("data.json", "r")
outfile = open("outputfile.csv", "w")

writer = csv.writer(outfile)

for row in json.loads(infile.read()):
    writer.writerow(row)

Here are few lines from the data:

{"author":"i_had_an_apostrophe","body":"\"It's not your fault.\"","author_flair_css_class":null,"link_id":"t3_5c0rn0","subreddit":"AskReddit","created_utc":1478736000,"subreddit_id":"t5_2qh1i","parent_id":"t1_d9t3q4d","author_flair_text":null,"id":"d9tlp0j"}
{"id":"d9tlp0k","author_flair_text":null,"parent_id":"t1_d9tame6","link_id":"t3_5c1efx","subreddit":"technology","created_utc":1478736000,"subreddit_id":"t5_2qh16","author":"willliam971","body":"9/11 inside job??","author_flair_css_class":null}
{"created_utc":1478736000,"subreddit_id":"t5_2qur2","link_id":"t3_5c44bz","subreddit":"excel","author":"excelevator","author_flair_css_class":"points","body":"Have you tried stepping through the code to analyse the values at each step?\n\n","author_flair_text":"442","id":"d9tlp0l","parent_id":"t3_5c44bz"}
{"created_utc":1478736000,"subreddit_id":"t5_2tycb","link_id":"t3_5c384j","subreddit":"OldSchoolCool","author":"10minutes_late","author_flair_css_class":null,"body":"**Thanks Hillary**","author_flair_text":null,"id":"d9tlp0m","parent_id":"t3_5c384j"}

I am thinking of getting all the fields that are available in csv file (as header) and if data is not available for that particular field, just fill it with NA.

Your question is missing information about what you're trying to accomplish, so I'm guessing about them. Note that csv files don't use "nulls" to represent missing fields, they just have delimiters with nothing between them, like 1,2,,4,5 which has no third field value.

Also how you open csv files varys depending on whether you're using Python 2 or 3. The code below is for Python 3.

#!/usr/bin/env python3
import csv
import json
import os

os.chdir('c:\\Users\\Desktop')
with open('sampledata.json', 'r', newline='') as infile:
    data = json.loads(infile.read())

# determine all the keys present, which will each become csv fields
fields = list(set(key for row in data for key in row))

with open('outputfile.csv', 'w', newline='') as outfile:
    writer = csv.DictWriter(outfile, fields)
    writer.writeheader()
    writer.writerows(row for row in data)

You can write a little function to build the rows for you, extracting data only where it is available and inserting None if it is not. What you called header, I called schema. Get all the fields, remove duplicates and sort, then build records based on the full set of fields and insert those records into the csv.

import csv
import json

def build_record(row, schema):
    values = []
    for field in schema:
        if field in row:
            values.append(row[field])
        else:
            values.append(None)
    return tuple(values)

infile = open("data.json", "r").readlines()
outfile = open("outputfile.csv", "wb")
writer = csv.writer(outfile)

rows = [json.loads(row.strip()) for row in infile]
schema = tuple(sorted(list(set([k for r in rows for k in r.keys()]))))
records = [build_record(r, schema) for r in rows]

writer.writerow(schema)

for rec in records:
    writer.writerow(rec)
outfile.close()

You can use Pandas to fill in the blanks for you (you may need to pip install pandas first):

import pandas as pd
import os

# load json
os.chdir('c:\\Users\\Desktop')
with open("data.json", "r") as infile:

    # read data into a Pandas DataFrame
    df = pd.read_json(infile)

# use Pandas to write to CSV
df.to_csv("myfile.csv")

I suggest you to use the csv.DictWriter class. That class needs an file to write to and a list of fieldnames (I've figured out from your data sample).

import csv
import json
import os

fieldnames = [
    "author", "author_flair_css_class", "author_flair_text", "body",
    "created_utc", "id", "link_id", "parent_id", "subreddit",
    "subreddit_id"
]

os.chdir('c:\\Users\\Desktop')
with open("data.json", "r") as infile:
    outfile = open("outputfile.csv", "w")

    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in infile:
        row_dict = json.loads(row)
        writer.writerow(row_dict)

    outfile.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM