[英]How to split the csv based on multiple columns
我正在嘗試根據兩列值將 csv 拆分為多個文件。 例如,
源文件:
Header1 Header2 Header3
Alpha energy 0.1
Alpha energy 0.34
Beta energy_imbalance 0.66
Beta energy 0.7
Beta energy 0.1
Gamma energy_imbalance 0.3
預期輸出:
輸出文件 1:
Header1 Header2 Header3
Alpha energy 0.1
Alpha energy 0.34
輸出文件2:
Header1 Header2 Header3
Beta energy_imbalance 0.66
輸出文件3:
Header1 Header2 Header3
Beta energy 0.7
Beta energy 0.1
輸出文件4:
Header1 Header2 Header3
Gamma energy_imbalance 0.3
以下是我開始的內容:
filein = open('test.csv')
csvin = csv.DictReader(filein)
outputs = {}
for row in csvin:
primaryValue = row['Header1']
secondaryValue = row['Header2']
if primaryValue not in outputs:
fileout = open('{}_{}.csv'.format(primaryValue,secondaryValue),'w')
dw = csv.DictWriter(fileout, fieldnames=csvin.fieldnames)
dw.writeheader()
outputs[primaryValue] = fileout, dw
outputs[primaryValue][1].writerow(row)
for fileout, _ in outputs.values():
fileout.close()
我能夠根據 column = Header1 拆分文件,但是我不確定如何進一步進行。
在這里嘗試:
csvin = csv.DictReader(filein)
csv_files = {}
files = []
for row in csvin:
key = (row['Header1'], row['Header2'])
if key not in csv_files:
# create the csv file
fileout = open('{}_{}.csv'.format(*key), 'w')
dw = csv.DictWriter(fileout, fieldnames=csvin.fieldnames)
dw.writeheader()
csv_files[key] = dw
files.append(fileout) # to close them later
# write the line into to corresponding csv writer
csv_files[key].writerow(row)
# close all files
for f in files: f.close()
這是一種按照@Barmar的建議行事的方法(只是它不使用f字符串來定義csv_files
字典鍵值):
import csv
infile_name = 'test.csv'
with open(infile_name, newline='') as infile:
reader = csv.DictReader(infile)
csv_files = {}
files = []
for row in reader:
key = '{}_{}'.format(row['Header1'], row['Header2'])
if key not in csv_files:
# Create the csv file
outfile_name = '{}.csv'.format(key)
fileout = open(outfile_name, 'w', newline='')
writer = csv.DictWriter(fileout, fieldnames=reader.fieldnames)
writer.writeheader()
csv_files[key] = writer
files.append(fileout) # To close them later.
# Write the line to corresponding csv writer.
csv_files[key].writerow(row)
# Close all csv output files.
for f in files:
f.close()
應用於示例輸入文件,這將產生以下csv輸出文件:
Alpha_energy.csv
Beta_energy.csv
Beta_energy_imbalance.csv
Gamma_energy_imbalance.csv
包含您期望的數據。
使用 pandas df.groupby()
是另一種基於多列值拆分 csv 的選項。
工作示例:
import pandas as pd
df = pd.read_csv('test.csv')
def df_to_grouped_csv(df):
df_group = df.groupby(['Header1', 'Header2'])
for name, group in df_group:
outfile = '_'.join(name) + '.csv'
group.to_csv(outfile, index=False)
輸出:
Alpha_energy.csv
Header1 Header2 Header3
0 Alpha energy 0.10
1 Alpha energy 0.34
Beta_energy.csv
Header1 Header2 Header3
3 Beta energy 0.7
4 Beta energy 0.1
Beta_energy_imbalance.csv
Header1 Header2 Header3
2 Beta energy_imbalance 0.66
Gamma_energy_imbalance.csv
Header1 Header2 Header3
5 Gamma energy_imbalance 0.3
在性能方面,與csv.DictWriter 方法相比,這應該顯示出改進(特別是對於大文件)。 但它確實需要導入熊貓。
表現:
Larger file (500,000 rows)
In [1]: %timeit df_to_grouped_csv()
865 ms ± 36.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [2]: %timeit csv_DictWriter_approach()
2.71 s ± 40.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.