简体   繁体   English

pandas read_excel在同一张纸上的多个表

[英]pandas read_excel multiple tables on the same sheet

Is it possible to read multiple tables from a sheet excel file using pandas ? 是否可以使用pandas从工作表excel文件中读取多个表? Something like: read table1 from row0 until row100 read table2 from row 102 until row202 ... 类似的东西:从row0读取table1直到row100从第102行读取table2直到row202 ...

Assuming we have the following Excel file: 假设我们有以下Excel文件:

在此输入图像描述

Solution: we are parsing the first sheet (index: 0 ) 解决方案:我们正在解析第一张表(索引: 0

xl = pd.ExcelFile(fn)
nrows = xl.book.sheet_by_index(0).nrows

df1 = xl.parse(0, skipfooter= nrows-(10+1)).dropna(axis=1, how='all')
df2 = xl.parse(0, skiprows=12).dropna(axis=1, how='all')

EDIT: skip_footer was replaced with skipfooter 编辑: skip_footer被替换skipfooter

Result: 结果:

In [123]: df1
Out[123]:
    a   b   c
0  78  68  33
1  62  26  30
2  99  35  13
3  73  97   4
4  85   7  53
5  80  20  95
6  40  52  96
7  36  23  76
8  96  73  37
9  39  35  24

In [124]: df2
Out[124]:
   c1  c2  c3 c4
0  78  88  59  a
1  82   4  64  a
2  35   9  78  b
3   0  11  23  b
4  61  53  29  b
5  51  36  72  c
6  59  36  45  c
7   7  64   8  c
8   1  83  46  d
9  30  47  84  d

First read in the entire csv file: 首先阅读整个csv文件:

import pandas as pd
df = pd.read_csv('path_to\\your_data.csv')

and then obtain the individual frames, for example using: 然后获取各个帧,例如使用:

df1 = df.iloc[:100,:]
df2 = df.iloc[100:200,:]

I wrote the following code to identify the multiple tables automatically, in case you have many files you need to process and don't want to look in each one to get the right row numbers. 我编写了以下代码来自动识别多个表,以防您需要处理多个文件,并且不希望查看每个表以获得正确的行号。 The code also looks for non-empty rows above each table and reads those as table metadata. 该代码还在每个表上方查找非空行,并将其作为表元数据读取。

def parse_excel_sheet(file, sheet_name=0, threshold=5):
    '''parses multiple tables from an excel sheet into multiple data frame objects. Returns [dfs, df_mds], where dfs is a list of data frames and df_mds their potential associated metadata'''
    xl = pd.ExcelFile(file)
    entire_sheet = xl.parse(sheet_name=sheet_name)

    # count the number of non-Nan cells in each row and then the change in that number between adjacent rows
    n_values = np.logical_not(entire_sheet.isnull()).sum(axis=1)
    n_values_deltas = n_values[1:] - n_values[:-1].values

    # define the beginnings and ends of tables using delta in n_values
    table_beginnings = n_values_deltas > threshold
    table_beginnings = table_beginnings[table_beginnings].index
    table_endings = n_values_deltas < -threshold
    table_endings = table_endings[table_endings].index
    if len(table_beginnings) < len(table_endings) or len(table_beginnings) > len(table_endings)+1:
        raise BaseException('Could not detect equal number of beginnings and ends')

    # look for metadata before the beginnings of tables
    md_beginnings = []
    for start in table_beginnings:
        md_start = n_values.iloc[:start][n_values==0].index[-1] + 1
        md_beginnings.append(md_start)

    # make data frames
    dfs = []
    df_mds = []
    for ind in range(len(table_beginnings)):
        start = table_beginnings[ind]+1
        if ind < len(table_endings):
            stop = table_endings[ind]
        else:
            stop = entire_sheet.shape[0]
        df = xl.parse(sheet_name=sheet_name, skiprows=start, nrows=stop-start)
        dfs.append(df)

        md = xl.parse(sheet_name=sheet_name, skiprows=md_beginnings[ind], nrows=start-md_beginnings[ind]-1).dropna(axis=1)
        df_mds.append(md)
    return dfs, df_mds

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM