[英]pandas read_excel multiple tables on the same sheet
Is it possible to read multiple tables from a sheet excel file using pandas ? 是否可以使用pandas从工作表excel文件中读取多个表? Something like: read table1 from row0 until row100 read table2 from row 102 until row202 ... 类似的东西:从row0读取table1直到row100从第102行读取table2直到row202 ...
Assuming we have the following Excel file: 假设我们有以下Excel文件:
Solution: we are parsing the first sheet (index: 0
) 解决方案:我们正在解析第一张表(索引: 0
)
xl = pd.ExcelFile(fn)
nrows = xl.book.sheet_by_index(0).nrows
df1 = xl.parse(0, skipfooter= nrows-(10+1)).dropna(axis=1, how='all')
df2 = xl.parse(0, skiprows=12).dropna(axis=1, how='all')
EDIT: skip_footer
was replaced with skipfooter
编辑: skip_footer
被替换skipfooter
Result: 结果:
In [123]: df1
Out[123]:
a b c
0 78 68 33
1 62 26 30
2 99 35 13
3 73 97 4
4 85 7 53
5 80 20 95
6 40 52 96
7 36 23 76
8 96 73 37
9 39 35 24
In [124]: df2
Out[124]:
c1 c2 c3 c4
0 78 88 59 a
1 82 4 64 a
2 35 9 78 b
3 0 11 23 b
4 61 53 29 b
5 51 36 72 c
6 59 36 45 c
7 7 64 8 c
8 1 83 46 d
9 30 47 84 d
First read in the entire csv
file: 首先阅读整个csv
文件:
import pandas as pd
df = pd.read_csv('path_to\\your_data.csv')
and then obtain the individual frames, for example using: 然后获取各个帧,例如使用:
df1 = df.iloc[:100,:]
df2 = df.iloc[100:200,:]
I wrote the following code to identify the multiple tables automatically, in case you have many files you need to process and don't want to look in each one to get the right row numbers. 我编写了以下代码来自动识别多个表,以防您需要处理多个文件,并且不希望查看每个表以获得正确的行号。 The code also looks for non-empty rows above each table and reads those as table metadata. 该代码还在每个表上方查找非空行,并将其作为表元数据读取。
def parse_excel_sheet(file, sheet_name=0, threshold=5):
'''parses multiple tables from an excel sheet into multiple data frame objects. Returns [dfs, df_mds], where dfs is a list of data frames and df_mds their potential associated metadata'''
xl = pd.ExcelFile(file)
entire_sheet = xl.parse(sheet_name=sheet_name)
# count the number of non-Nan cells in each row and then the change in that number between adjacent rows
n_values = np.logical_not(entire_sheet.isnull()).sum(axis=1)
n_values_deltas = n_values[1:] - n_values[:-1].values
# define the beginnings and ends of tables using delta in n_values
table_beginnings = n_values_deltas > threshold
table_beginnings = table_beginnings[table_beginnings].index
table_endings = n_values_deltas < -threshold
table_endings = table_endings[table_endings].index
if len(table_beginnings) < len(table_endings) or len(table_beginnings) > len(table_endings)+1:
raise BaseException('Could not detect equal number of beginnings and ends')
# look for metadata before the beginnings of tables
md_beginnings = []
for start in table_beginnings:
md_start = n_values.iloc[:start][n_values==0].index[-1] + 1
md_beginnings.append(md_start)
# make data frames
dfs = []
df_mds = []
for ind in range(len(table_beginnings)):
start = table_beginnings[ind]+1
if ind < len(table_endings):
stop = table_endings[ind]
else:
stop = entire_sheet.shape[0]
df = xl.parse(sheet_name=sheet_name, skiprows=start, nrows=stop-start)
dfs.append(df)
md = xl.parse(sheet_name=sheet_name, skiprows=md_beginnings[ind], nrows=start-md_beginnings[ind]-1).dropna(axis=1)
df_mds.append(md)
return dfs, df_mds
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.