I'm trying to get contact data from a big service. This is the part that I'm struggling with:
h='''<p><strong>Tel: </strong>01234 123456<strong> Tel2: </strong>01234123456<strong> Fax:
</strong>01234123456<br/></p>
<p><strong>Address:</strong> NAME, Address1, Address2, Address3, Postcode</p>
<p><strong>Website:</strong> <a href="https://www.test.com">https://www.test.com</a></p>
<p><strong>Email:</strong> <a href="mailto:test@example.com">test@example.com</a></p>
'''
I need to get keys (Tel, Fax, etc.) that are between <strong>
tags and corresponding values, so I can place it in a database. The number of keys may vary (for ex. no Fax or Website). I tried the following method:
import requests,bs4
soup=bs4.BeautifulSoup(h,"lxml")
print(soup)
for strong_tag in soup.find_all('strong'):
key=strong_tag.text
value=strong_tag.next_sibling
print(key,value)
I want to get this:
Tel: 01234 123456
Tel2: 01234123456
Fax: 01234123456
Address: NAME, Address1, Address2, Address3, Postcode
Website: https://www.test.com
Email: test@example.com
but instead, I'm getting that:
Tel: 01234 123456
Tel2: 01234123456
Fax: 01234123456
Address: NAME, Address1, Address2, Address3, Postcode
Website:
Email:
I can't get values for Email and Website. I can't just use soup.get_text()
because as mentioned I need to upload it to a database. Any ideas on how to get missing values for that two keys? THX
You can get them with findNext(). Actually I think BeautifulSoup doesn't find any string when next_sibling is applied, so it return a space. I may be wrong with this, but the following code actually works:
import requests,bs4
soup=bs4.BeautifulSoup(h,"lxml")
for strong_tag in soup.find_all('strong'):
key=strong_tag.text
value=strong_tag.next_sibling
if value == ' ':
value =strong_tag.findNext('a').text
print(key,value)
it returns:
Tel: 01234 123456
Tel2: 01234123456
Fax:
01234123456
Address: NAME, Address1, Address2, Address3, Postcode
Website: https://www.test.com
Email: test@example.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.