简体   繁体   中英

Loop through Multiple CSV Files and Merge with Specific Columns [Pandas]

I have a list of csv files. Each file has 5 columns, with 'id' as the only common column (primary key). The rest 4 columns are all different.

My point of interest is the 5th (last) column, which is different for each file. I want to merge them on 'id'.

I have tried the following code but it concatenates row wise, giving me too many duplicate 'id' as well as 'NaN' values:

filelist = glob.glob(path + "/*.csv")

li = []

for filename in filelist:

    df = pd.read_csv(filename, index_col=None, header=0, usecols=[0,5])

    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

I wanna concatenate them column wise with my point-of-interest column (5th column).

For example :

My list of files: ['df1.csv', 'df2.csv', 'df3.csv', 'df4.csv']

df1.csv has the following structure:

   ID  No1 AA
0   1   0   4
1   2   1   5
2   3   0   6

df2.csv has this structure:

   ID  No2 BB
0   2   0   5
1   3   1   6
2   4   0   7

The list goes on. My desired output would be:

    ID  AA  BB  CC  DD
0   1   4.0 NaN 0   1
1   2   5.0 5.0 1   0
2   3   6.0 6.0 1   0
3   4   NaN 7.0 1   1

Any suggestions would be appreciated. Thank you.

Starting from your example, setting 'ID' as index and joining implicitly on it seems like the easiest (retrieve simply the last column by position with -1 numerical index):

import pandas as pd

filelist = [
    '/tmp/csvs/df1.csv',
    '/tmp/csvs/df2.csv',
]

result = pd.DataFrame()

for f in filelist:
    df = pd.read_csv(f, sep='\s+').set_index('ID')
    last_col = df.columns[-1]
    result = result.join(df[last_col], how='outer')
result.reset_index(inplace=True)

result

Out[1]: 
   ID   AA   BB
0   1  4.0  NaN
1   2  5.0  5.0
2   3  6.0  6.0
3   4  NaN  7.0

Merge on ID using only the first and last columns:

df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")

After the first merge you'll want just:

df = df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer")

In use:

import pandas as pd

data1 = {"ID":[1,2,3], "No1":[0,1,0], "AA":[4,5,6]}
data2 = {"ID":[2,3,4], "No2":[0,1,0], "BB":[5,6,7]}
data3 = {"ID":[1,3,4], "No2":[0,1,0], "CC":[2,3,4]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")
print(df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer"))

Output:

   ID   AA   BB   CC
0   1  4.0  NaN  2.0
1   2  5.0  5.0  NaN
2   3  6.0  6.0  3.0
3   4  NaN  7.0  4.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM