防止熊猫将每个块的格式化标头重写为csv

Question

I have a dirty csv with an ugly header that I have formatted and stored in a list. 我有一个脏的csv，上面有一个丑陋的标头，已格式化并存储在列表中。

I want to read this csv chunk by chunk, perform some regex on the data, and then write to a new csv. 我想逐块读取此csv，对数据执行一些正则表达式，然后写入新的csv。

I'm using this function to do so 我正在使用此功能

def format_data(data_location, formatted_header):
    df = pd.read_csv(data_location, sep=',', skiprows=1,
                     header=0, names=formatted_header, chunksize=10000)

    for chunk in df:
        chunk = chunk.replace('(?!(([^"]*"){2})*[^"]*$),', '', regex=True)
        chunk.to_csv('formatted_data.csv', mode='a', index=False)

As I understand what I am doing here: 据我了解我在这里做什么：

pd.read_csv(data_location, sep=',', skiprows=1,
            header=0, names=formatted_header, chunksize=10000)

I am reading the csv from it's location, skipping the first ugly header row and replacing with my formatted_header. 我正在从它的位置读取csv，跳过了第一个丑陋的标头行，并替换为我的formatted_header。

My issue is that for each new chunk that is written to the new CSV, I am seeing the formatted header row repeated after every 10,000 rows. 我的问题是，对于写入新CSV的每个新块，我看到格式化的标题行每10,000行重复一次。 How can I prevent this from happening? 如何防止这种情况发生？

Answer 1

Since you only want to write the header once, use a boolean to see if you're on the first chunk. 由于您只想编写一次标头，因此请使用布尔值查看您是否在第一个块上。

For example: 例如：

write_header = True
for chunk in df:
    chunk = chunk.replace('(?!(([^"]*"){2})*[^"]*$),', '', regex=True)
    chunk.to_csv('formatted_data.csv', mode='a', index=False, header=write_header)
    write_header = False

防止熊猫将每个块的格式化标头重写为csv

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-03-07 16:36:09

防止熊猫将每个块的格式化标头重写为csv

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-03-07 16:36:09

解决方案1
4 已采纳 2018-03-07 16:36:09