简体   繁体   English

将 csv 文件合并为一个主文件

[英]consolidating csv file into one master file

I am facing the following challenges我面临以下挑战

I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column我有大约 400 个文件,我必须将它们合并到一个主文件中,但是有一个问题是文件具有不同的标题,当我尝试合并它时,将数据根据列放入不同的行

Example:- lets say i have two files C1 and C2 file C1.csv示例:-假设我有两个文件 C1 和 C2 文件 C1.csv

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2

and file C2.csv和文件 C2.csv

name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3

so I have these two files and from these files I want that when I consolidate these files the output will be like this:-所以我有这两个文件,我希望从这些文件中合并这些文件时,output 将是这样的:-

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3

for Consolidation I am using the following code对于合并我使用以下代码

import glob
import pandas as pd

directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist:            # Iterate through each of the 300 files
    df1 = pd.read_csv(file)      # create df using the file  
    df1col = list (df1.columns)  # save columns to a list
    df2 = consolidated           # set the consolidated as your df2
    df2col = list (df2.columns)  # save columns from consolidated result as list
    commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
    # print (commoncol)
    if commoncol == []:          # In first iteration, consolidated file is empty, which will return in a blank df
        consolidated = pd.concat([df1, df2], axis=1).fillna(value=0)  # concatenate (outer join) with no common columns replacing null values with 0
    else:
        consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0)        # merge both df specifying the common column and replace null values with 0
    # print (consolidated)   << Optionally, check the consolidated df at each iteration 

# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)

but it can't merge the columns having same data like the output shown earlier.但它不能合并具有相同数据的列,如前面显示的 output。

From your two-file example, you know the final (least common) header for the output, and you know what one of the bigger headers looks like.从您的两个文件示例中,您知道 output 的最终(最不常见)header,并且您知道较大的标题之一是什么样的。

My take on that is to think of every "other" kind of header as needing a mapping to the final header, like concatenating add-lines 1-3 into a single address field.我的看法是认为每一种“其他”类型的 header 都需要映射到最终的 header,例如将添加行 1-3 连接到单个地址字段中。 We can use the csv module to read and write row-by-row and send the rows to the appropriate consolidator (mapping) based on the header of the input file.我们可以使用 csv 模块逐行读写,并根据输入文件的 header 将行发送到适当的合并器(映射)。

The csv module provides a DictReader and DictWriter which makes dealing with fields you know by name very handy; csv 模块提供了一个 DictReader 和 DictWriter ,这使得处理您知道名称的字段非常方便; especially, the DictWriter() constructor has the extrasaction="ignore" option which means that if you tell the writer your fields are:特别是, DictWriter() 构造函数具有 extrasaction="ignore" 选项,这意味着如果您告诉作者您的字段是:

Col1, Col2, Col3

and you pass a dict like:你传递一个像这样的字典:

{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"} 

it will just ignore Col4 and only write Cols 1-3:它只会忽略 Col4,只写 Cols 1-3:

writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})

# Col1,Col2,Col3
# val1,val2,val3
import csv


def consolidate_add_lines_1_to_3(row):
    row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
    return row


# Add other consolidators here...
# ...


Final_header = ["name", "phone-no", "address"]

f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()

for fname in ["file1.csv", "file2.csv"]:
    f_in = open(fname, newline="")
    reader = csv.DictReader(f_in)

    for row in reader:
        if "add-line1" in row and "add-line2" in row and "add-line3" in row:
            row = consolidate_add_lines_1_to_3(row)

        # Add conditions for other consolidators here...
        # ...

        writer.writerow(row)

    f_in.close()

f_out.close()

If there are more than one kind of other header, you'll need to seek those out, and figure out the extra consolidators to write, and the conditions to trigger them in for row in reader loop.如果有不止一种其他 header,您需要找出这些,并找出要写入的额外合并器,以及for row in reader循环中触发它们的条件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM