简体   繁体   中英

How can I get text out of a <dt> tag with a <span> inside?

I'm trying to extract the text from inside a <dt> tag with a <span> inside on www.uszip.com:

Here is an example of what I'm trying to get:

<dt>Land area<br><span class="stype">(sq. miles)</span></dt>
<dd>14.28</dd>

I want to get the 14.28 out of the tag. This is how I'm currently approaching it:

Note: soup is the BeautifulSoup version of the entire webpage's source code:

soup.find("dt",text="Land area").contents[0]

However, this is giving me a

AttributeError: 'NoneType' object has no attribute 'contents'

I've tried a lot of things and I'm not sure how to approach this. This method works for some of the other data on this page, like:

<dt>Total population</dt>
<dd>22,234<span class="trend trend-down" title="-15,025 (-69.77% since 2000)">&#9660;</span></dd>

Using soup.find("dt",text="Total population").next_sibling.contents[0] on this returns '22,234' .

How should I try to first identify the correct tag and then get the right data out of it?

Unfortunately, you cannot match tags with both text and nested tags, based on the contained text alone.

You'd have to loop over all <dt> without text:

for dt in soup.find_all('dt', text=False):
    if 'Land area' in dt.text:
        print dt.contents[0]

This sounds counter-intuitive, but the .string attribute for such tags is empty, and that is what BeautifulSoup is matching against. .text contains all strings in all nested tags combined, and that is not matched against.

You could also use a custom function to do the search:

soup.find_all(lambda t: t.name == 'dt' and 'Land area' in t.text)

which essentially does the same search with the filter encapsulated in a lambda function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM