简体   繁体   中英

Read the all excel files in a folder and split the each file name, add splitted name into the dataframe

All files have a name convention such as NPS_Platform_FirstLabel_Session_Language_Version.xlsx I want to have additional columns like Platform, FirstLabel, Session, Language, Version these will column names and the values determined by filenames. I coded the following, it works but the value of added columns just came from the last file. For example, assume that the last filename is NPS_MEM_GAIT_Science_EN_10.xlsx. Therefore, all of the added columns values are MEM, GAIT_Science, etc. Not the corresponding file names.

import glob
import os
import pandas as pd

path = "C:/Users/User/blabla"
all_files = glob.glob(os.path.join(path, "*.xlsx")) #make list of paths

df = pd.DataFrame()

for f in all_files:
    data = pd.read_excel(f)
    df = df.append(data)
    file_name = os.path.splitext(os.path.basename(f))[0]
    nameList = []
    nameList = file_name.rsplit('_')  
    df['Platform'] = nameList[1]
    df['First label']= nameList[2]
    df['Session'] = nameList[3]
    df['Language'] = nameList[4]
    df['Version'] = nameList[5]
df

I started with nameList[1] since I don't want NPS. Any suggestions or feedback?

I have found a solution, I leave it here since there are more views than I expected.

import glob
import os
import pandas as pd


path = "C:/Users/User/....."
all_files = glob.glob(os.path.join(path, "*.xlsx")) #make list of paths

df_files= [pd.read_excel(filename) for filename in all_files]

for dataframe, filename in zip(df_files, all_files):
    filename =os.path.splitext(os.path.basename(filename))[0]
    filename = filename.rsplit('_') 
    dataframe['Platform'] = filename[1]
    dataframe['First label']= filename[2]
    dataframe['Session'] = filename[3]
    dataframe['Language'] = filename[4]
    dataframe['Version'] = filename[5]
df= pd.concat(files_df, ignore_index=True)

I think the reason is I was just iterating over the files, not the dataframe that I was trying to build. With this, I can iterate over the dataframe and file names at the same time. I have found this solution on https://jonathansoma.com/lede/foundations-2017/classes/working-with-many-files/class/ But still if you can give explicit answer about why the first code does not work as I want, it would be great

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM