How can i extract table headings from both table types from the below html using beautiful soup
<body>
<p>some other data 1</p>
<p>Table1 heading</p>
<div></div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data1_00</p></td>
<td><p>data1_01</p></td>
</tr>
<tr>
<td><p>data1_10</p></td>
<td><p>data1_11</p></td>
</tr>
</tbody></table></div>
</div>
<br><br>
<div>some other data 2</div>
<div>Table2 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00</p></td>
<td><p>data2_01</p></td>
</tr>
<tr>
<td><p>data2_10</p></td>
<td><p>data2_11</p></td>
</tr>
</tbody></table></div>
</div>
</body>
On the first table, heading comes inside <p>
tag and on the second table heading comes inside <div>
tag. Also on the second table there is a blank <div>
tag just above the table.
How to extract both table headings?
Currently i am searching for the previous <div>
above current table using table.find_previous('div')
and the text inside it will be saved as heading.
from bs4 import BeautifulSoup
import urllib.request
htmlpage = urllib.request.urlopen(url)
page = BeautifulSoup(htmlpage, "html.parser")
all_divtables = page.find_all('table')
for table in all_divtables:
curr_div = table
while True:
curr_div = curr_div.find_previous('div')
if len(curr_div.find_all('table')) > 0:
continue
else:
heading = curr_div.text.strip()
print(heading)
break
desired output :
Table1 heading
Table2 heading
You can use find_previous()
function with lambda parameter, that selects first previous tag which doesn't contain other table and doesn't contain empty string:
data = '''<body>
<p>some other data 1</p>
<p>Table1 heading</p>
<div></div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data1_00</p></td>
<td><p>data1_01</p></td>
</tr>
<tr>
<td><p>data1_10</p></td>
<td><p>data1_11</p></td>
</tr>
</tbody></table></div>
</div>
<br><br>
<div>some other data 2</div>
<div>Table2 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00</p></td>
<td><p>data2_01</p></td>
</tr>
<tr>
<td><p>data2_10</p></td>
<td><p>data2_11</p></td>
</tr>
</tbody></table></div>
</div>
<div>some other data 3</div>
<div>Table3 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00z</p></td>
<td><p>data2_01z</p></td>
</tr>
<tr>
<td><p>data2_10z</p></td>
<td><p>data2_11z</p></td>
</tr>
</tbody></table></div>
</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00x</p></td>
<td><p>data2_01x</p></td>
</tr>
<tr>
<td><p>data2_10x</p></td>
<td><p>data2_11x</p></td>
</tr>
</tbody></table></div>
</div>
</body>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for table in soup.select('table'):
for i in table.find_previous(lambda t: not t.find('table') and t.text.strip() != ''):
if i.find_parents('table'):
continue
print(i)
print('*' * 80)
Prints:
Table1 heading
********************************************************************************
Table2 heading
********************************************************************************
Table3 heading
********************************************************************************
urldata='''<body>
<p>some other data 1</p>
<p>Table1 heading</p>
<div></div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data1_00</p></td>
<td><p>data1_01</p></td>
</tr>
<tr>
<td><p>data1_10</p></td>
<td><p>data1_11</p></td>
</tr>
</tbody></table></div>
</div>
<br><br>
<div>some other data 2</div>
<div>Table2 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00</p></td>
<td><p>data2_01</p></td>
</tr>
<tr>
<td><p>data2_10</p></td>
<td><p>data2_11</p></td>
</tr>
</tbody></table></div>
</div>
</body>'''
import re
from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(data, 'lxml')
results =soup.body.findAll(text=re.compile('heading'))
for result in results:
print(result)
**Output:-**
Table1 heading
Table2 heading
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.