简体   繁体   中英

how to stack specific sheets from multiple excel workbooks using python

I am trying to stack multiple workbooks by sheet in python. Currently my folder contains over 50 individual workbooks which are separated by date and usually contain up to 3 sheets, although not always. The sheets include "Pay", "Tablet", "SIMO".

I want to try to stack all data from the "Tablet" sheet into a new workbook and have been using the following code.

**import os
import pandas as pd
path = r"path_location"
files = os.listdir(path)
df = pd.DataFrame()
for file in files:
    if file.endswith('.xlsx'):
        df = df.append(pd.read_excel(file, sheet_name='Tablet'), ignore_index=True) 
df.head() 
df.to_csv('Tablet_total.csv')**

however after checking the files I realised this is not pulling from all workbooks that have the sheet "Tablet". I suspect this may be due to the fact that not all workbooks have this sheet but in case I missed anything I'd greatly appreciate some ideas to what I may be doing wrong.

Also, as a final request, the sheet "Tablet" across all workbooks has unnecessary columns in the beginning. I have tried incorporating df.drop(index=df.index[:7], axis=0, inplace=True) into my loop yet this only removes the first 7 rows from the first iteration. Again any support with this would be greatly appreciated.

Many thanks

First, I would check that you do not have any .xls files or other excel file suffixes with:

import os
path = r"path_location"
files = os.listdir(path)
print({
    file.split('.')[1]
    for file in files
})

Then, I would check that you don't have any sheets with trailing white spaces or capitalization issues with:

import os
import pandas
path = r"path_location"
files = os.listdir(path)
print({
    sheet_name
    for file in files
    if file.endswith('.xlsx') 
    for sheet_name in pandas.ExcelFile(file).sheet_names
})

I would use pandas.concat() with a list comprehension to concatenate the sheets. I would also add a check to ensure that the worksheet as a sheet named 'Tablet'. Finally, if want to not use the first seven columns you should a) do it on each dataframe as it's read in and before it is concatenated together with the other dataframes, and b) first include all the rows and then specify the columns with .iloc[:, 7:]

import os
import pandas
path = r"path_location"
files = os.listdir(path)
df = pandas.concat([
    pandas.read_excel(file, sheet_name='Tablet').iloc[:, 7:]
    for file in files
    if file.endswith('.xlsx') and 'Tablet' in pandas.ExcelFile(file).sheet_names 
])
df.head()
  1. Check if you have excel files with other extensions - xlsm, xlsb and others?
  2. In order to remove seven rows at each iteration, you need to select a temporary dataframe when reading from excel and delete them from it.
    df_tmp = pd.read_excel(file, sheet_name='Tablet')
    df_tmp.drop(index=df_tmp.index[:7], axis=0, inplace=True)
  1. Since append is deprecated, use concat() instead. pandas.DataFrame.append

    pd.concat([df, df_tmp], ignore_index=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM