简体   繁体   中英

How to export DataFrame to_json in append mode - Python Pandas?

I have an existing json file in a format of list of dicts.

$cat output.json
[{'a':1, 'b':2}, {'a':2, 'b':3}]

And I have a DataFrame

df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
              "b":pd.Series([3,4], index=list('CD')})

I want to save "df" with to_json to append it to file output.json:

df.to_json('output.json', orient='records')  #  mode='a' not available for to_json

* There is append mode='a' for to_csv, but not for to_json really.

The expected generated output.json file will be:

    [{'a':1, 'b':2}, {'a':2, 'b':3}, {'a':1, 'b':3}, {'a':2, 'b':4}]

The existing file output.json can be huge (say Tetabytes), is it possible to append the new dataframe result without loading the file?

No, you can't append to a json file without re-writing the whole file using pandas or the json module. You might be able to modify the file "manually" by opening the file in a mode and seeking to the correct position and inserting your data. I wouldn't recommend this though. Better to just use a file format other than json if your file is going to be larger than your RAM.

This answer also might help. It doesn't create valid json files (instead each line is a json string), but its goal is very similar to yours.

You could do this. It will write each record/row as json in new line.

f = open(outfile_path, mode="a")

for chunk_df in data:
    f.write(chunk_df.to_json(orient="records", lines=True))

f.close()

May be you need to think in terms of orient='records' :

def to_json_append(df,file):
    '''
    Load the file with
    pd.read_json(file,orient='records',lines=True)
    '''
    df.to_json('tmp.json',orient='records',lines=True)
    #append
    f=open('tmp.json','r')
    k=f.read()
    f.close()
    f=open(file,'a')
    f.write('\n') #Prepare next data entry
    f.write(k)
    f.close()

df=pd.read_json('output.json')
#Save again as lines
df.to_json('output.json',orient='records',lines=True)
#new data
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
              "b":pd.Series([3,4], index=list('CD')})
#append:
to_json_append(df,'output.json')

To load full data

pd.read_json('output.json',orient='records',lines=True)

I've solved it just by using in built pandas.DataFrame methods. You need to remember about the performance in case of huge dataframes (there are ways to deal with it). Code:

if os.path.isfile(dir_to_json_file):
    # if exist open read it
    df_read = pd.read_json(dir_to_json_file, orient='index')
    # add data that you want to save
    df_read = pd.concat([df_read, df_to_append], ignore_index=True)
    # in case of adding to much unnecessery data (if you need)
    df_read.drop_duplicates(inplace=True)

    # save it to json file in AppData.bin
    df_read.to_json(dir_to_json_file, orient='index')
else:
    df_to_append.to_json(dir_to_json_file, orient='index')

Usecase, write big amount of data to JSON file with small memory:

Let's say we have 1,000 dataframe, each dataframe is like 1000,000 line of json. Each dataframe needs 100MB, the total file size would be 1000 * 100MB = 100GB.

Solution:

  1. use buffer to store content of each dataframe
  2. use pandas to dump it to text
  3. use append mode to write text to the end of file
import io
import pandas as pd
from pathlib_mate import Path

n_lines_per_df = 10
n_df = 3
columns = ["id", "value"]
value = "alice@example.com"
f = Path(__file__).change(new_basename="big-json-file.json")
if not f.exists():
    for nth_df in range(n_df):
        data = list()
        for nth_line in range(nth_df * n_lines_per_df, (nth_df + 1) * n_lines_per_df):
            data.append((nth_line, value))
        df = pd.DataFrame(data, columns=columns)
        buffer = io.StringIO()
        df.to_json(
            buffer,
            orient="records",
            lines=True,
        )
        with open(f.abspath, "a") as file:
            file.write(buffer.getvalue())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM