简体   繁体   中英

Extract text of element line by line

I am using BeautifulSoup to extract various elements from a website. I have run across a situation for which I am unable to determine an answer. I want to extract the text of a link, but the link is line broken over 3 lines. For example:

<span class="location-address">
<a href="https://www.google.com/maps" target="_blank">
"123 Main St"
<br>
"Suite 456" 
<br> 
"Everywhere, USA 12345"
</a>

When I use find_all("span",{"class":"location-address"})[0].text I am given something like "123 Main StSuite 456Everywhere, USA 12345" and I would prefer a more natural response.

You may try to get find_all("span",{"class":"location-address")[0].contents instead of find_all("span",{"class":"location-address")[0].text . It should return all html content within link tag. Then you may replace <br /> with \\n or do whatever you need.

If you only have one span tag with class=location-address then simply use the find() method.

>>> from bs4 import BeautifulSoup
>>> html = """<span class="location-address">
... <a href="https://www.google.com/maps" target="_blank">
... "123 Main St"
... <br>
... "Suite 456" 
... <br> 
... "Everywhere, USA 12345"
... </a>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.find('span', class_='location-address').find_next('a').get_text(strip=True).replace('"', '')
'123 Main StSuite 456Everywhere, USA 12345'

But if you have more than one "span" tag with the given class, using the find_all() method you can do something like this:

>>> for span in soup.find_all('span', class_='location-address'):
...     span.find('a').get_text(strip=True).replace('"', '')
... 
'123 Main StSuite 456Everywhere, USA 12345'

Or use a css selector :

>>> for a in soup.select('span.location-address > a'):
...     a.get_text(strip=True).replace('"', '')
... 
'123 Main StSuite 456Everywhere, USA 12345'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM