使用python从电子表格中提取多个表

Question

我想提取一系列 Excel 电子表格的多个表格，其中一些表格可能包含多个表格，以将表格单独存储为 csv 文件。 该表可能是这样的：

如果我使用 pandas read_excel 阅读它

import pandas as pd
pd.read_excel('table_example.xlsx',header=None)

我会得到这样的东西：

我怎么能提取不同的表？ 在我的例子中，表有 NaN 值，这可能是一个额外的复杂问题。

[EDIT1] 可以使用 Pandas 生成类似于 excel 表的内容：

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

我试图使用内置的熊猫函数找到表的限制：

df[df.isnull().all(axis=1)]

我可以使用第一行和第二行来设置水平分割，也可以进行第一次分割，但我不知道如何选择已识别行上方或下方的单元格。 或者即使这是最方便的方法。

免责声明：在我的情况下，表格在标题上方的行中总是有一个标签，这是因为这些表格是由非 python 软件读取的，该软件使用它们来识别表格的开始位置。 我决定不考虑这些标签来询问其他人可能会遇到的更通用的问题。

Answer 1

import numpy as np
import pandas as pd

# I have assumed that the tables are "separated" by at least one row with only NaN values

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

print(df)

# Extract by rows

nul_rows = list(df[df.isnull().all(axis=1)].index)

list_of_dataframes = []
for i in range(len(nul_rows) - 1):
    list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])


# Remove null columns

cleaned_tables = []
for _df in list_of_dataframes:
    cleaned_tables.append(_df.dropna(axis=1, how='all'))

# cleaned_tables is a list of the dataframes

print(cleaned_tables[0])
print(cleaned_tables[1])

Answer 2

只要这两个表由一行或一列 NaN 分隔，这可能有助于动态定位和提取表。

我使用了https://stackoverflow.com/a/54675526 中的 boundingbox 解决方案

from skimage.measure import label, regionprops

#this basically converts your table into 0s and 1s where 0 is NaN and 1 for non NaN 
binary_rep = np.array(df.notnull().astype('int'))

list_of_dataframes = []
l = label(binary_rep)
for s in regionprops(l):
    #the bbox contains the extremes of the bounding box. So the top left and bottom right cell locations of the table.
    list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])

使用python从电子表格中提取多个表

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-04-06 15:12:53

解决方案2
1 2020-08-20 06:55:26

使用python从电子表格中提取多个表

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-04-06 15:12:53

解决方案2 1 2020-08-20 06:55:26

解决方案1
3 已采纳 2017-04-06 15:12:53

解决方案2
1 2020-08-20 06:55:26