I have a .xlsx file in which there are 3 different table available, seprated by three key words "Setteled", "Refund", "Charged" in want to read all the table in separate dataframes, sharing file data and desired output.
Setteled
IN.Type STRA STRB STRC
CRBD 2487 XR XL0054
DFRS 3754 MY XL0684
CRBD 7356 DF XL8911
DFRS 4487 DF XL58999
DFRS 7785 MY XL76568
CRBD 8235 GL XL0635
DFRS 2468 PQ XL4569
DFRS 9735 GR XL7589
CRBD 6486 TY XL5566
DFRS 1023 PQ XL27952
Refund
IN.Type STRD STRE
DFRS 5898 RT
DFRS 5684 YU
CRBD 2564 RT
DFRS 1564 OP
DFRS 2548 YU
CRBD 4478 GL
CRBD 4515 OP
DFRS 5695 YU
DFRS 8665 RT
CRBD 1487 LK
Charged
IN.Type STRF STRG
CRBD 1289 GH
CRBD 8546 JK
CRBD 6599 LP
DFRS 7899 JK
DFRS 1456 GH
CRBD 6988 JK
DFRS 1468 LP
DFRS 4697 GH
DFRS 7941 LP
DFRS 1636 JK
Now after reading the file, I want above three tables in different dataframe as below.
df = "Row available below Setteled"
IN.Type STRA STRB STRC
CRBD 2487 XR XL0054
DFRS 3754 MY XL0684
CRBD 7356 DF XL8911
DFRS 4487 DF XL58999
DFRS 7785 MY XL76568
CRBD 8235 GL XL0635
DFRS 2468 PQ XL4569
DFRS 9735 GR XL7589
CRBD 6486 TY XL5566
DFRS 1023 PQ XL27952
df2 = "Row available below Refund"
IN.Type STRD STRE
DFRS 5898 RT
DFRS 5684 YU
CRBD 2564 RT
DFRS 1564 OP
DFRS 2548 YU
CRBD 4478 GL
CRBD 4515 OP
DFRS 5695 YU
DFRS 8665 RT
CRBD 1487 LK
df3 = "Rows available below Charged"
IN.Type STRF STRG
CRBD 1289 GH
CRBD 8546 JK
CRBD 6599 LP
DFRS 7899 JK
DFRS 1456 GH
CRBD 6988 JK
DFRS 1468 LP
DFRS 4697 GH
DFRS 7941 LP
DFRS 1636 JK
Are your "tables" actual Excel tables? If so, you could use the approach explained here .
Eg:
import pandas as pd
from openpyxl import load_workbook
filename = "tables.xlsx"
#read file
wb = load_workbook(filename)
#access specific sheet
ws = wb["Sheet1"]
mapping = {}
for entry, data_boundary in ws.tables.items():
#parse the data within the ref boundary
data = ws[data_boundary]
#extract the data
#the inner list comprehension gets the values for each cell in the table
content = [[cell.value for cell in ent]
for ent in data
]
header = content[0]
#the contents ... excluding the header
rest = content[1:]
#create dataframe with the column names
#and pair table name with dataframe
df = pd.DataFrame(rest, columns = header)
mapping[entry] = df
This will get you a dictionary with all the tables in a specific sheet.
If your "tables" aren't actual Excel tables, but simply ranges, we'll have to define the ranges ourselves. Code below should work, provided that all your "tables" are in the same worksheet, all key words are in row 1, and the actual "tables" start in row 2. It should not matter in what column the first table starts or whether tables are separated by empty cols or not.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
filename = "data_tables.xlsx"
#read file
wb = load_workbook(filename)
#access specific sheet
ws = wb["Sheet1"]
#create dict to store df "tables"
mapping = {}
#get cols for key words
col_numbers = [idx+1 for idx, cell in enumerate(ws[1]) if cell.value != None]
#set vars to empty strings
first_address = ''
last_address = ''
entry = ''
for col in range(1, ws.max_column + 1):
#convert int to col letter
col_letter = get_column_letter(col)
#if no value in col
last_row = 0
#find last cell in col with value with loop over reversed col entries
for cell in ws[col_letter][::-1]:
if cell.value != None:
last_row = cell.row
break
#if col in col_numbers this is where a new "table" starts
if col in col_numbers:
#set entry for dict key
entry = ws.cell(1,col).value
#get first and last address
first_address = f'{col_letter}{2}'
last_address = f'{col_letter}{last_row}'
#if col is not empty and last_address is not empty string, then we are
#still inside one of our "tables", so update last_address
if last_row != 0 and last_address != '':
last_address = f'{col_letter}{last_row}'
#create entry if
# (we are in empty col | the next col starts a new "table" | we're in the last col)
# AND we having yet created this table (e.g. tables separated by multiple empty cols)
# AND first_address is not empty string (we are not yet inside the first table)
if (last_row == 0 or col+1 in col_numbers or col == ws.max_column) and entry not in mapping.keys() \
and first_address != '':
#create string with table range
table_range = f'{first_address}:{last_address}'
#extract the data
#the inner list comprehension gets the values for each cell in the table
data = ws[table_range]
content = [[cell.value for cell in ent] for ent in data]
#the contents ... excluding the header
header = content[0]
rest = content[1:]
#create dataframe with the column names
#and pair table name with dataframe
df = pd.DataFrame(rest, columns = header)
mapping[entry] = df
I've testing this code on a worksheet that has data like this:
Works as expected. If your keywords contain duplicates, the present code will only create a df for the first keyword. If you'd like the code to handle duplicates, you need to add a check after entry = ws.cell(1,col).value
to see if entry
is already used as a key
in the dict
. If so, assign a different val to entry
and continue. Let me know if you experience any difficulties.
I am not sure if this is the best approach, but you can use
pd.read_excel(file, skiprows=1, skipfooter=#)
So for the first dataframe you need to skip one line at the start and #number of lines below the last line of data you have
You can also read it all as dataframe and then slice it using df.loc
With the update of how the data looks in the worksheet, I think a different approach is easier. For that reason, I'm adding a new answer. In this case, we can simply use pandas and numpy: read the file into 1 df and then split it into 3 dfs. (Maybe with the other scenarios this is also possible, but this is another matter.)
This should do it:
import pandas as pd
import numpy as np
filename = "data_tables.xlsx"
# read excel file
df = pd.read_excel(filename, sheet_name='Sheet1')
# drop all cols with only NaN
df = df.dropna(axis=1, how="all")
# split dfs on rows with only NaN
df_list = np.split(df, df[df.isnull().all(1)].index)
# dictionary to store dfs
mapping = {}
# loop through list of dfs
for df in df_list:
# drop all rows and cols with only NaN
df = df.dropna(how="all")
df = df.dropna(axis=1, how="all")
# first cell should now contain your key word
key = df.iloc[0,0]
# second row should now contain your headers
df.columns = list(df.iloc[1])
# content starts at third row
df = df[2:]
# reset the index
df.reset_index(drop=True, inplace=True)
# add to dictionary
mapping[key] = df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.