简体   繁体   中英

How to merge multiple raw input CSV's with pandas containing similar columns with slightly different names?

I wrote some code to combine multiple CSV's that are interpretered with Pandas and appended to one combined CSV.

The issue I have is that the CSV files are delivered by multiple parties (monthly) and often contain differences with regard to column names, while they essentially contain the same information. For instance:

CSV 1 | ID | Instance number | | -------- | -------------- | | 1 | 401421 | | 2 | 420138 |

CSV 2 | ID | Instance NO | | -------- | -------------- | | 1 | 482012 | | 2 | 465921 |

This will result in two columns in the combined file, Instance Number & Instance NO unless I rename the column beforehand while the idea is to automatically process all files without intervention beforehand.

The solution that should work is to use combine_first or fillna, but next time the column may be entered as eg Instance No/number.

Since improving data delivery isn't an option, is there any smart way to solve issues like this without having to write out all possible variations and remap them to one leading column?

Thanks in advance!

I think first you need to have a dictionary of all possible names or you can quickly add those whenever you get a new one and rename the column names. for example

general_dict = { 'SLNO': ['Sl No', 'SNo']}

col_list = all_df.columns.to_list()
rename_dict = {}

for col in col_list:
    for key, val in general_dict.items():
        if col in val:
            rename_dict[col] = key

            break
all_df.rename(columns=rename_dict, inplace=True)

The short answer is no, as your asking the computer to think for itself. You do however have multiple options to deal with common scenarios.

If the column order and/or positions are fixed you can make use of the header=0, names=['ID', 'Instance'] to ignore the headers sent in the file and make use of known data

You can also generate a config file that maps all possible wrong header names to the right one

If the columns are in the same order in all the files, you could try out like this,

  1. pre-define columns in the first place
  2. change the column names for all the file in the first place itself and concat the dataframes
data1 = pd.read_csv('data/data1.csv')
data2 = pd.read_csv('data/data2.csv')

data1.columns = ['A', 'B', 'C']
data2.columns = ['A', 'B', 'C']

pd.concat([data1, data2], axis=0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM