I have to read an Excel sheet in pandas which contains multiple sheets. Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
Ie how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
dynamically skip top blank rows of excel in python pandas seems to be related but not the solution as only the first headers are accepted.
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.