简体   繁体   中英

Pandas usecols and append from multiple dataframes

I am not sure what is the perfect way to achieve this:

I have multiple xlsx files and the customer_id column has different names in each file. Suppose the following example:

xlsx1: customer_id
xlsx2: ID
slsx3: client_ID
xlsx4: cus_id
xlsx5: consumer_number
xlsx6: customer_number
...etc

I want to read all the xlsx in a folder then just extract the customer id columns and append them to one dataframe.

What I did so far:

I created a list for every single expected customer_id column in the xlsx files:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

Then I read all the xlsx files in the folder:

all_data = pd.DataFrame()
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f, usecols = customer_id)
    all_data = all_data.append(df,ignore_index=True)

Here I got the error:

ValueError: Usecols do not match columns, columns expected but not found:

I believe the usecols is matching all the columns names in the list in every xlsx file while I need to get the one column in xlsx files that matches the name.

one way is to read the full excel, reindex with the possible columns of ID in customer_id that will generate nan columns for the wrong names, then dropna them. Rename the column for later concat . Also don't use pandas append in a loop, append to a list and concat later, it is faster. so you get:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM