[英]Reading different tables from a directory into an array of separate dataframes
[英]Reading three tables from same file in different Dataframes in pandas
我有一个 .xlsx 文件,其中有 3 个不同的表可用,由三个关键词“已解决”、“退款”、“收费”分隔,以便在单独的数据帧中读取所有表,共享文件数据和所需的输出。
Setteled
IN.Type STRA STRB STRC
CRBD 2487 XR XL0054
DFRS 3754 MY XL0684
CRBD 7356 DF XL8911
DFRS 4487 DF XL58999
DFRS 7785 MY XL76568
CRBD 8235 GL XL0635
DFRS 2468 PQ XL4569
DFRS 9735 GR XL7589
CRBD 6486 TY XL5566
DFRS 1023 PQ XL27952
Refund
IN.Type STRD STRE
DFRS 5898 RT
DFRS 5684 YU
CRBD 2564 RT
DFRS 1564 OP
DFRS 2548 YU
CRBD 4478 GL
CRBD 4515 OP
DFRS 5695 YU
DFRS 8665 RT
CRBD 1487 LK
Charged
IN.Type STRF STRG
CRBD 1289 GH
CRBD 8546 JK
CRBD 6599 LP
DFRS 7899 JK
DFRS 1456 GH
CRBD 6988 JK
DFRS 1468 LP
DFRS 4697 GH
DFRS 7941 LP
DFRS 1636 JK
现在阅读文件后,我想要以下不同数据框中的三个表。
df = "已解决的可用行"
IN.Type STRA STRB STRC
CRBD 2487 XR XL0054
DFRS 3754 MY XL0684
CRBD 7356 DF XL8911
DFRS 4487 DF XL58999
DFRS 7785 MY XL76568
CRBD 8235 GL XL0635
DFRS 2468 PQ XL4569
DFRS 9735 GR XL7589
CRBD 6486 TY XL5566
DFRS 1023 PQ XL27952
df2 = "退款下方可用的行"
IN.Type STRD STRE
DFRS 5898 RT
DFRS 5684 YU
CRBD 2564 RT
DFRS 1564 OP
DFRS 2548 YU
CRBD 4478 GL
CRBD 4515 OP
DFRS 5695 YU
DFRS 8665 RT
CRBD 1487 LK
df3 = "收费下可用的行"
IN.Type STRF STRG
CRBD 1289 GH
CRBD 8546 JK
CRBD 6599 LP
DFRS 7899 JK
DFRS 1456 GH
CRBD 6988 JK
DFRS 1468 LP
DFRS 4697 GH
DFRS 7941 LP
DFRS 1636 JK
您的“表格”是实际的 Excel 表格吗? 如果是这样,您可以使用此处说明的方法。
例如:
import pandas as pd
from openpyxl import load_workbook
filename = "tables.xlsx"
#read file
wb = load_workbook(filename)
#access specific sheet
ws = wb["Sheet1"]
mapping = {}
for entry, data_boundary in ws.tables.items():
#parse the data within the ref boundary
data = ws[data_boundary]
#extract the data
#the inner list comprehension gets the values for each cell in the table
content = [[cell.value for cell in ent]
for ent in data
]
header = content[0]
#the contents ... excluding the header
rest = content[1:]
#create dataframe with the column names
#and pair table name with dataframe
df = pd.DataFrame(rest, columns = header)
mapping[entry] = df
这将为您提供一个字典,其中包含特定工作表中的所有表格。
如果您的“表格”不是实际的 Excel 表格,而只是范围,我们必须自己定义范围。 下面的代码应该可以工作,前提是您的所有“表格”都在同一个工作表中,所有关键字都在第 1 行,实际的“表格”从第 2 行开始。第一个表格从哪一列开始或者是否表格是否由空列分隔。
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
filename = "data_tables.xlsx"
#read file
wb = load_workbook(filename)
#access specific sheet
ws = wb["Sheet1"]
#create dict to store df "tables"
mapping = {}
#get cols for key words
col_numbers = [idx+1 for idx, cell in enumerate(ws[1]) if cell.value != None]
#set vars to empty strings
first_address = ''
last_address = ''
entry = ''
for col in range(1, ws.max_column + 1):
#convert int to col letter
col_letter = get_column_letter(col)
#if no value in col
last_row = 0
#find last cell in col with value with loop over reversed col entries
for cell in ws[col_letter][::-1]:
if cell.value != None:
last_row = cell.row
break
#if col in col_numbers this is where a new "table" starts
if col in col_numbers:
#set entry for dict key
entry = ws.cell(1,col).value
#get first and last address
first_address = f'{col_letter}{2}'
last_address = f'{col_letter}{last_row}'
#if col is not empty and last_address is not empty string, then we are
#still inside one of our "tables", so update last_address
if last_row != 0 and last_address != '':
last_address = f'{col_letter}{last_row}'
#create entry if
# (we are in empty col | the next col starts a new "table" | we're in the last col)
# AND we having yet created this table (e.g. tables separated by multiple empty cols)
# AND first_address is not empty string (we are not yet inside the first table)
if (last_row == 0 or col+1 in col_numbers or col == ws.max_column) and entry not in mapping.keys() \
and first_address != '':
#create string with table range
table_range = f'{first_address}:{last_address}'
#extract the data
#the inner list comprehension gets the values for each cell in the table
data = ws[table_range]
content = [[cell.value for cell in ent] for ent in data]
#the contents ... excluding the header
header = content[0]
rest = content[1:]
#create dataframe with the column names
#and pair table name with dataframe
df = pd.DataFrame(rest, columns = header)
mapping[entry] = df
我已经在具有如下数据的工作表上测试了此代码:
按预期工作。 如果您的关键字包含重复项,则当前代码只会为第一个关键字创建一个 df。 如果您希望代码处理重复项,则需要在entry = ws.cell(1,col).value
之后添加一个检查,以查看entry
是否已用作dict
中的key
。 如果是这样,为entry
分配一个不同的 val 并继续。 如果您遇到任何困难,请告诉我。
我不确定这是否是最好的方法,但你可以使用
pd.read_excel(file, skiprows=1, skipfooter=#)
因此,对于第一个数据帧,您需要在开头跳过一行,并在您拥有的最后一行数据下方跳过 #number of lines
您也可以将其全部读取为数据框,然后使用 df.loc 对其进行切片
随着工作表中数据外观的更新,我认为另一种方法更容易。 出于这个原因,我正在添加一个新的答案。 在这种情况下,我们可以简单地使用 pandas 和 numpy:将文件读入 1 个 df,然后将其拆分为 3 个 df。 (也许在其他情况下这也是可能的,但这是另一回事。)
这应该这样做:
import pandas as pd
import numpy as np
filename = "data_tables.xlsx"
# read excel file
df = pd.read_excel(filename, sheet_name='Sheet1')
# drop all cols with only NaN
df = df.dropna(axis=1, how="all")
# split dfs on rows with only NaN
df_list = np.split(df, df[df.isnull().all(1)].index)
# dictionary to store dfs
mapping = {}
# loop through list of dfs
for df in df_list:
# drop all rows and cols with only NaN
df = df.dropna(how="all")
df = df.dropna(axis=1, how="all")
# first cell should now contain your key word
key = df.iloc[0,0]
# second row should now contain your headers
df.columns = list(df.iloc[1])
# content starts at third row
df = df[2:]
# reset the index
df.reset_index(drop=True, inplace=True)
# add to dictionary
mapping[key] = df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.