简体   繁体   中英

Pandas: Continuously write from function to csv

I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv .

However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.

Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?

Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.

import pandas as pd

df = pd.read_csv("read.csv")

def crawl(a):
    #Create x, y
    return pd.Series([x, y])

df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)

This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.

import pandas as pd

#csv file to be read in 
in_csv = '/path/to/read/file.csv'

#csv to write data to 
out_csv = 'path/to/write/file.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of chunks of data to write to the csv
chunksize = 10

#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
     df = pd.read_csv(in_csv,
          header=None,
          nrows = chunksize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

     df.to_csv(out_csv,
          index=False,
          header=False,
          mode='a',#append data to csv file
          chunksize=chunksize)#size of data to append for each loop

In the end, this is what I came up with. Thanks for helping out!

import pandas as pd

df1 = pd.read_csv("read.csv")

run = 0

def crawl(a):

    global run
    run = run + 1

    #Create x, y

    df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])

    if run == 1:
        df2.to_csv("output.csv")
    if run != 1:
        df2.to_csv("output.csv", header=None, mode="a")

df1["Column A"].apply(crawl)

I would suggest this:

with open("write.csv","a") as f:
    df.to_csv(f,header=False,index=False)

The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.

I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:

for ix, row in df.iterrows():
    row['Column A'] = crawl(row['Column A'])

    # if you wish to mantain the header
    if ix == 0:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
    else:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM