简体   繁体   中英

Write output of pandas.io.parsers.TextFileReader to pandas.DataFrame

I have a large CSV file which I am reading using user defined input "num_rows" (number of rows) in parts of chunks, using "chunksize" argument, which returns "pandas.io.parsers.TextFileReader" object as follows:

num_rows = int(input("Enter number of rows to be processed

chunk = pd.read_csv("large_file.csv", chunksize = number_of_rows)

for data_chunk in chunk:
    # some processing
    # Finally, write back results to Pandas DataFrame-
    data_chunk["new_column"] = some_precalculated_value

However, this approach clearly does not work. How do I go about writing back the results of the chunks back to the original Pandas DataFrame, which in my case happens to be "large_file.csv"?

Thanks!

What you did will not modify the csv because each data_chunk is not linked to the original data.
You can write each data_chunk to a separate csv file

reader = pd.read_csv("large_file.csv", chunksize = number_of_rows)

for i, data_chunk in enumerate(reader):
    data_chunk["new_column"] = some_precalculated_value
    data_chunk.to_csv("large_file_part{}.csv".format(i))

To use larger than memory data like a dataframe, you can use dask . If you did the above, then you should just have to do:

import dask.dataframe as dd

ddf = dd.read_csv("large_file_part*.csv")
ddf.to_csv("large_file.csv", single_file=True)

Alternatively, you can initially load your dataframe with dask, and performs computations with it.
It automatically splits your dataframe into partitions, and performs operations just like it is a regular pandas dataframe, in a lazy fashion.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM