简体   繁体   中英

consolidating csv file into one master file

I am facing the following challenges

I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column

Example:- lets say i have two files C1 and C2 file C1.csv

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2

and file C2.csv

name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3

so I have these two files and from these files I want that when I consolidate these files the output will be like this:-

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3

for Consolidation I am using the following code

import glob
import pandas as pd

directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist:            # Iterate through each of the 300 files
    df1 = pd.read_csv(file)      # create df using the file  
    df1col = list (df1.columns)  # save columns to a list
    df2 = consolidated           # set the consolidated as your df2
    df2col = list (df2.columns)  # save columns from consolidated result as list
    commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
    # print (commoncol)
    if commoncol == []:          # In first iteration, consolidated file is empty, which will return in a blank df
        consolidated = pd.concat([df1, df2], axis=1).fillna(value=0)  # concatenate (outer join) with no common columns replacing null values with 0
    else:
        consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0)        # merge both df specifying the common column and replace null values with 0
    # print (consolidated)   << Optionally, check the consolidated df at each iteration 

# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)

but it can't merge the columns having same data like the output shown earlier.

From your two-file example, you know the final (least common) header for the output, and you know what one of the bigger headers looks like.

My take on that is to think of every "other" kind of header as needing a mapping to the final header, like concatenating add-lines 1-3 into a single address field. We can use the csv module to read and write row-by-row and send the rows to the appropriate consolidator (mapping) based on the header of the input file.

The csv module provides a DictReader and DictWriter which makes dealing with fields you know by name very handy; especially, the DictWriter() constructor has the extrasaction="ignore" option which means that if you tell the writer your fields are:

Col1, Col2, Col3

and you pass a dict like:

{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"} 

it will just ignore Col4 and only write Cols 1-3:

writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})

# Col1,Col2,Col3
# val1,val2,val3
import csv


def consolidate_add_lines_1_to_3(row):
    row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
    return row


# Add other consolidators here...
# ...


Final_header = ["name", "phone-no", "address"]

f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()

for fname in ["file1.csv", "file2.csv"]:
    f_in = open(fname, newline="")
    reader = csv.DictReader(f_in)

    for row in reader:
        if "add-line1" in row and "add-line2" in row and "add-line3" in row:
            row = consolidate_add_lines_1_to_3(row)

        # Add conditions for other consolidators here...
        # ...

        writer.writerow(row)

    f_in.close()

f_out.close()

If there are more than one kind of other header, you'll need to seek those out, and figure out the extra consolidators to write, and the conditions to trigger them in for row in reader loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM