简体   繁体   中英

Can't find string after a tag with BeautifulSoup in Python?

In this HTML I want to get the string of it but no matter what I try it doesn't work (string = none)

      <a href="/analyze/default/index/49398962/1/34925733" target="_blank">
       <img alt="" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
       Jue VioIe Grace
      </a>

There's a few of these on the page and I tried this:

print([a.string for a in soup.findAll('td', class_='tou')])

The output is just none.

EDIT: here is the entire page HTML, hope this helps, just to clarify, I need to find all instances like the one above and extract their string

http://pastebin.com/4mvcMsJu

You need to select the a from the parent td and call .text , the text is inside the anchor which is a child of the td:

print([td.a.text for td in soup.find_all('td', class_='tou')])

There obviously is a td with the class tou or you would not be getting a list with None:

In [10]: html = """<td class='tou'>
          <a href="/analyze/default/index/49398962/1/34925733" target="_blank">
       <img alt="" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
       Jue VioIe Grace
      </a>
      </td>"""

In [11]: soup = BeautifulSoup(html,"html.parser")

In [12]: [a.string for a in soup.find_all('td', class_='tou')]
Out[12]: [None]

In [13]: [td.a.text for td in soup.find_all('td', class_='tou')]
Out[13]: [u'\n\n       Jue VioIe Grace\n      ']

You could also call .text on the td:

In [14]: [td.text for td in soup.find_all('td', class_='tou')]
Out[14]: [u'\n\n\n       Jue VioIe Grace\n      \n']

But that would maybe get more than you want.

using your full html from pastebin:

In [18]: import requests

In [19]: soup = BeautifulSoup(requests.get("http://pastebin.com/raw/4mvcMsJu").content,"html.parser")

In [20]: [td.a.text.strip() for td in soup.find_all('td', class_='tou')]
Out[20]: 
 [u'KElTHMCBRlEF',
 u'game 5 loser',
 u'Cris',
 u'interestingstare',
 u'ApoIlo Price',
 u'Zary',
 u'Adrian Ma',
 u'Liquid Inori',
 u'focus plz',
 u'Shiphtur',
 u'Cody Sun',
 u'ApoIIo Price',
 u'Pobelter',
 u'Jue VioIe Grace',
 u'Valkrin',
 u'Piggy Kitten',
 u'1 and 17',
 u'BLOCK IT',
 u'JiaQQ1035716423',
 u'Twitchtv Flaresz']

In this case td.text.strip() gives you the same output:

In [23]: [td.text.strip() for td in soup.find_all('td', class_='tou')]
Out[23]: 
[u'KElTHMCBRlEF',
 u'game 5 loser',
 u'Cris',
 u'interestingstare',
 u'ApoIlo Price',
 u'Zary',
 u'Adrian Ma',
 u'Liquid Inori',
 u'focus plz',
 u'Shiphtur',
 u'Cody Sun',
 u'ApoIIo Price',
 u'Pobelter',
 u'Jue VioIe Grace',
 u'Valkrin',
 u'Piggy Kitten',
 u'1 and 17',
 u'BLOCK IT',
 u'JiaQQ1035716423',
 u'Twitchtv Flaresz']

But you should understand that there is a difference. Also the difference between .string vs .text

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('input.html'), 'lxml')
>>> [tag.text.strip() for tag in soup]
[u'Jue VioIe Grace']

If we want to restrict the search to text in anchor tags:

>>> [tag.text.strip() for tag in soup.findAll('a')]
[u'Jue VioIe Grace']

Note that there are no td tags in your sample input and no tag has the attribute class_='tou' .

Well, if your soup variable is made off that html piece of code then the output you get is None because there is no td element inside it, and of course there is not td element with class=tou .

Now, if you want to get that text maybe you could call soup.findAll(text=True) which outputs something like ['\\n', '\\n Jue VioIe Grace\\n ']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM