简体   繁体   中英

Python Beautifulsoup loop a tag (<td><b>) and get all its sibling (a href)

I have the following html file to traverse through Python's beautifulsoup:

<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish)  Jan</b> 
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a>&nbsp    
<td><b>1940 (English)  Jan</b> 
<a href="./1940/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940/jan/4/home.htm" target="_parent">4</a>&nbsp     
<tr><td><b>1940 (Spanish)  Feb</b> 
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a>&nbsp 
 ...OMITTED...
<td><b>1940 (English)  Indices</b> 
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a>&nbsp 
</table>

This html some has closing td tags, some does not have but I guess this does not matter. What I'm trying to get are the text of the hrefs and the corresponding bold text like so:

1940 (Spanish)  Jan|2
1940 (Spanish)  Jan|4
1940 (English)  Jan|2
1940 (English)  Jan|4
   ...
1940 (English)  Indices|Jan to Mar

I can actually iterate the bold tds with my code, what I am trying to figure out is on the part of which to iterate the a hrefs' text. The python code I have right now is below:

import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"

  page = requests.get(url)
  from bs4 import BeautifulSoup
  soup = BeautifulSoup(page.content, 'html.parser')

  elements = soup.find("td").find_all_next("b")
  for el in elements:        
    print (el)

Thanks in advance!

This should help you:

from bs4 import BeautifulSoup

html = """
<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish)  Jan</b> 
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a>&nbsp    
<td><b>1940 (English)  Jan</b> 
<a href="./1940/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940/jan/4/home.htm" target="_parent">4</a>&nbsp     
<tr><td><b>1940 (Spanish)  Feb</b> 
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a>&nbsp 
<td><b>1940 (English)  Indices</b> 
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a>&nbsp 
</table>
"""

soup = BeautifulSoup(html,'html5lib')

table = soup.find('table')

a_tags = table.find_all('a')

for a in a_tags:
    print(a.text)

Output:

2
4
2
4
1
Jan to Mar

This is the full version of it (with the html code extracted using requests and with proper formatting):

from bs4 import BeautifulSoup
import requests

url = "http://nlpdl.nlp.gov.ph/OG01/1902"

page = requests.get(url).text

soup = BeautifulSoup(page,'html5lib')

table = soup.find('table')

a_tags = table.find_all('a')
elements = soup.find("td").find_all_next("b")

for x in range(len(elements)):
    print(f"{elements[x].text}|{a_tags[x].text}")

Output:

1902 (Spanish)  Sep|10
1902 (Spanish)  Oct|17
1902 (Spanish)  Nov|24
1902 (Spanish)  Dec|1
1902 (Spanish)  Indices|8

You can use .find_previous('b') to find the matching <b> tags:

from bs4 import BeautifulSoup


txt = '''<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish)  Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a>&nbsp
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a>&nbsp
<td><b>1940 (English)  Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a>&nbsp
<a href="./1940/jan/4/home.htm" target="_parent">4</a>&nbsp
<tr><td><b>1940 (Spanish)  Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a>&nbsp
 ...OMITTED...
<td><b>1940 (English)  Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a>&nbsp
</table>'''

soup = BeautifulSoup(txt, 'html.parser')

for a in soup.select('a'):
    print(a.find_previous('b').text, a.text)

Prints:

1940 (Spanish)  Jan 2
1940 (Spanish)  Jan 4
1940 (English)  Jan 2
1940 (English)  Jan 4
1940 (Spanish)  Feb 1
1940 (English)  Indices Jan to Mar

Try this:

import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"

page = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

elements = soup.find("td").find_all_next("b")
links = soup.find("table").findAll("a")

for el,li in zip(elements,links):
  print('{a}|{b}'.format(a=el.text,b=li.text))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM