简体   繁体   中英

Scraping with BeautifulSoup where keys are tagged /strong and values are plain text and/or tagged

I'm trying to get contact data from a big service. This is the part that I'm struggling with:

h='''<p><strong>Tel: </strong>01234 123456<strong> Tel2: </strong>01234123456<strong>   Fax: 
</strong>01234123456<br/></p>
<p><strong>Address:</strong> NAME, Address1, Address2, Address3, Postcode</p>
<p><strong>Website:</strong> <a href="https://www.test.com">https://www.test.com</a></p>
<p><strong>Email:</strong> <a href="mailto:test@example.com">test@example.com</a></p>
'''

I need to get keys (Tel, Fax, etc.) that are between <strong> tags and corresponding values, so I can place it in a database. The number of keys may vary (for ex. no Fax or Website). I tried the following method:

import requests,bs4
soup=bs4.BeautifulSoup(h,"lxml")

print(soup)
for strong_tag in soup.find_all('strong'):
key=strong_tag.text
value=strong_tag.next_sibling
print(key,value)

I want to get this:

Tel: 01234 123456 
Tel2: 01234123456   
Fax: 01234123456
Address: NAME, Address1, Address2, Address3, Postcode
Website: https://www.test.com
Email: test@example.com

but instead, I'm getting that:

Tel:  01234 123456
Tel2:  01234123456
Fax:  01234123456
Address:  NAME, Address1, Address2, Address3, Postcode
Website:  
Email: 

I can't get values for Email and Website. I can't just use soup.get_text() because as mentioned I need to upload it to a database. Any ideas on how to get missing values for that two keys? THX

You can get them with findNext(). Actually I think BeautifulSoup doesn't find any string when next_sibling is applied, so it return a space. I may be wrong with this, but the following code actually works:

import requests,bs4
soup=bs4.BeautifulSoup(h,"lxml")

for strong_tag in soup.find_all('strong'):
    key=strong_tag.text
    value=strong_tag.next_sibling
    if value == ' ':
        value =strong_tag.findNext('a').text
    print(key,value)

it returns:

Tel:  01234 123456
 Tel2:  01234123456
   Fax: 
 01234123456
Address:  NAME, Address1, Address2, Address3, Postcode
Website: https://www.test.com
Email: test@example.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM