Write output of pandas.io.parsers.TextFileReader to pandas.DataFrame

Question

I have a large CSV file which I am reading using user defined input "num_rows" (number of rows) in parts of chunks, using "chunksize" argument, which returns "pandas.io.parsers.TextFileReader" object as follows:我有一个大型 CSV 文件，我正在使用用户定义的输入“num_rows”（行数）在部分块中读取该文件，使用“chunksize”参数，该参数返回“pandas.io.parsers.TextFileReader”ZA8CFDE69AC31BD466696 如下

num_rows = int(input("Enter number of rows to be processed

chunk = pd.read_csv("large_file.csv", chunksize = number_of_rows)

for data_chunk in chunk:
    # some processing
    # Finally, write back results to Pandas DataFrame-
    data_chunk["new_column"] = some_precalculated_value

However, this approach clearly does not work.但是，这种方法显然行不通。 How do I go about writing back the results of the chunks back to the original Pandas DataFrame, which in my case happens to be "large_file.csv"?我如何 go 关于将块的结果写回原始 Pandas DataFrame，在我的情况下恰好是“large_file.csv”？

Thanks!谢谢！

Answer 1

What you did will not modify the csv because each data_chunk is not linked to the original data.您所做的不会修改 csv 因为每个data_chunk都没有链接到原始数据。
You can write each data_chunk to a separate csv file您可以将每个data_chunk写入单独的 csv 文件

reader = pd.read_csv("large_file.csv", chunksize = number_of_rows)

for i, data_chunk in enumerate(reader):
    data_chunk["new_column"] = some_precalculated_value
    data_chunk.to_csv("large_file_part{}.csv".format(i))

To use larger than memory data like a dataframe, you can use dask .要使用大于 memory 的数据，例如 dataframe，您可以使用dask 。 If you did the above, then you should just have to do:如果您执行了上述操作，那么您只需要执行以下操作：

import dask.dataframe as dd

ddf = dd.read_csv("large_file_part*.csv")
ddf.to_csv("large_file.csv", single_file=True)

Alternatively, you can initially load your dataframe with dask, and performs computations with it.或者，您可以最初使用 dask 加载 dataframe，并使用它执行计算。
It automatically splits your dataframe into partitions, and performs operations just like it is a regular pandas dataframe, in a lazy fashion.它会自动将您的 dataframe 拆分为多个分区，并像普通的 pandas dataframe 一样以懒惰的方式执行操作。

Write output of pandas.io.parsers.TextFileReader to pandas.DataFrame

问题描述

1 个解决方案

解决方案1
1 2019-11-21 15:02:26

Write output of pandas.io.parsers.TextFileReader to pandas.DataFrame

问题描述

1 个解决方案

解决方案1 1 2019-11-21 15:02:26

解决方案1
1 2019-11-21 15:02:26