將 csv 文件合並為一個主文件

Question

我面臨以下挑戰

我有大約 400 個文件，我必須將它們合並到一個主文件中，但是有一個問題是文件具有不同的標題，當我嘗試合並它時，將數據根據列放入不同的行

示例：-假設我有兩個文件 C1 和 C2 文件 C1.csv

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2

和文件 C2.csv

name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3

所以我有這兩個文件，我希望從這些文件中合並這些文件時，output 將是這樣的：-

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3

對於合並我使用以下代碼

import glob
import pandas as pd

directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist:            # Iterate through each of the 300 files
    df1 = pd.read_csv(file)      # create df using the file  
    df1col = list (df1.columns)  # save columns to a list
    df2 = consolidated           # set the consolidated as your df2
    df2col = list (df2.columns)  # save columns from consolidated result as list
    commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
    # print (commoncol)
    if commoncol == []:          # In first iteration, consolidated file is empty, which will return in a blank df
        consolidated = pd.concat([df1, df2], axis=1).fillna(value=0)  # concatenate (outer join) with no common columns replacing null values with 0
    else:
        consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0)        # merge both df specifying the common column and replace null values with 0
    # print (consolidated)   << Optionally, check the consolidated df at each iteration 

# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)

但它不能合並具有相同數據的列，如前面顯示的 output。

Answer 1

從您的兩個文件示例中，您知道 output 的最終（最不常見）header，並且您知道較大的標題之一是什么樣的。

我的看法是認為每一種“其他”類型的 header 都需要映射到最終的 header，例如將添加行 1-3 連接到單個地址字段中。 我們可以使用 csv 模塊逐行讀寫，並根據輸入文件的 header 將行發送到適當的合並器（映射）。

csv 模塊提供了一個 DictReader 和 DictWriter ，這使得處理您知道名稱的字段非常方便； 特別是， DictWriter() 構造函數具有 extrasaction="ignore" 選項，這意味着如果您告訴作者您的字段是：

Col1, Col2, Col3

你傳遞一個像這樣的字典：

{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"}

它只會忽略 Col4，只寫 Cols 1-3：

writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})

# Col1,Col2,Col3
# val1,val2,val3

import csv


def consolidate_add_lines_1_to_3(row):
    row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
    return row


# Add other consolidators here...
# ...


Final_header = ["name", "phone-no", "address"]

f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()

for fname in ["file1.csv", "file2.csv"]:
    f_in = open(fname, newline="")
    reader = csv.DictReader(f_in)

    for row in reader:
        if "add-line1" in row and "add-line2" in row and "add-line3" in row:
            row = consolidate_add_lines_1_to_3(row)

        # Add conditions for other consolidators here...
        # ...

        writer.writerow(row)

    f_in.close()

f_out.close()

如果有不止一種其他 header，您需要找出這些，並找出要寫入的額外合並器，以及for row in reader循環中觸發它們的條件。

將 csv 文件合並為一個主文件

問題描述

1 個解決方案

解決方案1
0 2022-09-21 05:49:51

將 csv 文件合並為一個主文件

問題描述

1 個解決方案

解決方案1 0 2022-09-21 05:49:51

解決方案1
0 2022-09-21 05:49:51