使用Python按位置从html提取表

Question

I want to extract a specific table from an html document that contains mutliple tables, but unfortunately there are no identifiers. 我想从包含多张表的html文档中提取特定表，但是不幸的是没有标识符。 There is a table title, however. 但是，有一个表标题。 I just can't seem to figure it out. 我似乎无法弄清楚。

Here is an example html file 这是一个示例html文件

<BODY>
<TABLE>
<TH>
<H3>    <BR>TABLE 1    </H3>
</TH>
<TR>
<TD>Data 1    </TD>
<TD>Data 2    </TD>
</TR>
<TR>
<TD>Data 3    </TD>
<TD>Data 4    </TD>
</TR>
<TR>
<TD>Data 5    </TD>
<TD>Data 6    </TD>
</TR>
</TABLE>

<TABLE>
<TH>
<H3>    <BR>TABLE 2    </H3>
</TH>
<TR>
<TD>Data 7    </TD>
<TD>Data 8    </TD>
</TR>
<TR>
<TD>Data 9    </TD>
<TD>Data 10    </TD>
</TR>
<TR>
<TD>Data 11    </TD>
<TD>Data 12    </TD>
</TR>
</TABLE>
</BODY>

I can use beautifulSoup 4 to get tables by id or name, but I need just a single table that is only identifiable by position. 我可以使用beautifulSoup 4通过ID或名称获取表，但是我只需要一个只能通过位置识别的表。

I know that I can get the first table with: 我知道我可以通过以下方式获得第一张桌子：

tmp = f.read()
soup = BeautifulSoup(tmp) ## make it readable
table = soup.find('table') ### gets first table

but how would I get the second table? 但是我将如何获得第二张桌子？

Answer 1

You can rely on the table title. 您可以依靠表标题。

Find the element by text passing a function as a text argument value, then get the parent : 通过传递函数作为text参数值的text查找元素，然后获取父对象：

table_name = "TABLE 1" 

table = soup.find(text=lambda x: x and table_name in x).find_parent('table')

Answer 2

If it's only identifiable by position, meaning it's always the 2nd table in the website, you could do: 如果只能通过位置来标识，这意味着它始终是网站中的第二张表，则可以执行以下操作：

tmp = f.read()
soup = BeautifulSoup(tmp)

# this will return the second table from the website
all_tables = soup.find_all('table')
second_table = all_tables[1]

使用Python按位置从html提取表

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-03-10 20:43:17

解决方案2
0 2015-03-10 20:54:57

使用Python按位置从html提取表

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-03-10 20:43:17

解决方案2 0 2015-03-10 20:54:57

解决方案1
2 已采纳 2015-03-10 20:43:17

解决方案2
0 2015-03-10 20:54:57