I am not sure what is the perfect way to achieve this:
I have multiple xlsx files and the customer_id column has different names in each file. Suppose the following example:
xlsx1: customer_id
xlsx2: ID
slsx3: client_ID
xlsx4: cus_id
xlsx5: consumer_number
xlsx6: customer_number
...etc
I want to read all the xlsx in a folder then just extract the customer id columns and append them to one dataframe.
What I did so far:
I created a list for every single expected customer_id column in the xlsx files:
customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]
Then I read all the xlsx files in the folder:
all_data = pd.DataFrame()
for f in glob.glob("./*.xlsx"):
df = pd.read_excel(f, usecols = customer_id)
all_data = all_data.append(df,ignore_index=True)
Here I got the error:
ValueError: Usecols do not match columns, columns expected but not found:
I believe the usecols is matching all the columns names in the list in every xlsx file while I need to get the one column in xlsx files that matches the name.
one way is to read the full excel, reindex
with the possible columns of ID in customer_id
that will generate nan columns for the wrong names, then dropna
them. Rename the column for later concat
. Also don't use pandas append
in a loop, append
to a list and concat
later, it is faster. so you get:
l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
df.columns = ["ID"] # to have only one column once concat
l.append(df)
all_data = pd.concat(l, ignore_index=True) # concat all data
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.