![](/img/trans.png)
[英]Creating a Master excel file from dynamic CSV output using Python
[英]consolidating csv file into one master file
我面臨以下挑戰
我有大約 400 個文件,我必須將它們合並到一個主文件中,但是有一個問題是文件具有不同的標題,當我嘗試合並它時,將數據根據列放入不同的行
示例:-假設我有兩個文件 C1 和 C2 文件 C1.csv
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
和文件 C2.csv
name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3
所以我有這兩個文件,我希望從這些文件中合並這些文件時,output 將是這樣的:-
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3
對於合並我使用以下代碼
import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist: # Iterate through each of the 300 files
df1 = pd.read_csv(file) # create df using the file
df1col = list (df1.columns) # save columns to a list
df2 = consolidated # set the consolidated as your df2
df2col = list (df2.columns) # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []: # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0) # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0) # merge both df specifying the common column and replace null values with 0
# print (consolidated) << Optionally, check the consolidated df at each iteration
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)
但它不能合並具有相同數據的列,如前面顯示的 output。
從您的兩個文件示例中,您知道 output 的最終(最不常見)header,並且您知道較大的標題之一是什么樣的。
我的看法是認為每一種“其他”類型的 header 都需要映射到最終的 header,例如將添加行 1-3 連接到單個地址字段中。 我們可以使用 csv 模塊逐行讀寫,並根據輸入文件的 header 將行發送到適當的合並器(映射)。
csv 模塊提供了一個 DictReader 和 DictWriter ,這使得處理您知道名稱的字段非常方便; 特別是, DictWriter() 構造函數具有 extrasaction="ignore" 選項,這意味着如果您告訴作者您的字段是:
Col1, Col2, Col3
你傳遞一個像這樣的字典:
{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"}
它只會忽略 Col4,只寫 Cols 1-3:
writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})
# Col1,Col2,Col3
# val1,val2,val3
import csv
def consolidate_add_lines_1_to_3(row):
row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
return row
# Add other consolidators here...
# ...
Final_header = ["name", "phone-no", "address"]
f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()
for fname in ["file1.csv", "file2.csv"]:
f_in = open(fname, newline="")
reader = csv.DictReader(f_in)
for row in reader:
if "add-line1" in row and "add-line2" in row and "add-line3" in row:
row = consolidate_add_lines_1_to_3(row)
# Add conditions for other consolidators here...
# ...
writer.writerow(row)
f_in.close()
f_out.close()
如果有不止一種其他 header,您需要找出這些,並找出要寫入的額外合並器,以及for row in reader
循環中觸發它們的條件。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.