繁体   English   中英

将 csv 文件合并为一个主文件

[英]consolidating csv file into one master file

我面临以下挑战

我有大约 400 个文件,我必须将它们合并到一个主文件中,但是有一个问题是文件具有不同的标题,当我尝试合并它时,将数据根据列放入不同的行

示例:-假设我有两个文件 C1 和 C2 文件 C1.csv

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2

和文件 C2.csv

name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3

所以我有这两个文件,我希望从这些文件中合并这些文件时,output 将是这样的:-

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3

对于合并我使用以下代码

import glob
import pandas as pd

directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist:            # Iterate through each of the 300 files
    df1 = pd.read_csv(file)      # create df using the file  
    df1col = list (df1.columns)  # save columns to a list
    df2 = consolidated           # set the consolidated as your df2
    df2col = list (df2.columns)  # save columns from consolidated result as list
    commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
    # print (commoncol)
    if commoncol == []:          # In first iteration, consolidated file is empty, which will return in a blank df
        consolidated = pd.concat([df1, df2], axis=1).fillna(value=0)  # concatenate (outer join) with no common columns replacing null values with 0
    else:
        consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0)        # merge both df specifying the common column and replace null values with 0
    # print (consolidated)   << Optionally, check the consolidated df at each iteration 

# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)

但它不能合并具有相同数据的列,如前面显示的 output。

从您的两个文件示例中,您知道 output 的最终(最不常见)header,并且您知道较大的标题之一是什么样的。

我的看法是认为每一种“其他”类型的 header 都需要映射到最终的 header,例如将添加行 1-3 连接到单个地址字段中。 我们可以使用 csv 模块逐行读写,并根据输入文件的 header 将行发送到适当的合并器(映射)。

csv 模块提供了一个 DictReader 和 DictWriter ,这使得处理您知道名称的字段非常方便; 特别是, DictWriter() 构造函数具有 extrasaction="ignore" 选项,这意味着如果您告诉作者您的字段是:

Col1, Col2, Col3

你传递一个像这样的字典:

{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"} 

它只会忽略 Col4,只写 Cols 1-3:

writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})

# Col1,Col2,Col3
# val1,val2,val3
import csv


def consolidate_add_lines_1_to_3(row):
    row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
    return row


# Add other consolidators here...
# ...


Final_header = ["name", "phone-no", "address"]

f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()

for fname in ["file1.csv", "file2.csv"]:
    f_in = open(fname, newline="")
    reader = csv.DictReader(f_in)

    for row in reader:
        if "add-line1" in row and "add-line2" in row and "add-line3" in row:
            row = consolidate_add_lines_1_to_3(row)

        # Add conditions for other consolidators here...
        # ...

        writer.writerow(row)

    f_in.close()

f_out.close()

如果有不止一种其他 header,您需要找出这些,并找出要写入的额外合并器,以及for row in reader循环中触发它们的条件。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM