简体   繁体   中英

Python - Extracting data from this Html tag using BS4, instead of getting None

This is my code:

html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN- 
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''


soup = BeautifulSoup(html, 'html.parser')

print(soup.select_one('td').string)

It returns None. I think it has to do with that span tag which is empty. I think it goes into that span tag, and returns those contents? So I either want to delete that span tag, or stop as soon as it finds the 'Data I want to extract', or tell it to ignore empty tags

If there are no empty tags inside 'td' it actually works.

Is there a way to ignore empty tags in general and go one step back? Instead of ignoring this specific span tag?

Sorry if this is too elementary, but I spent a fair amount of time searching.

Use .text property, not .string :

html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

print(soup.select_one('td').text)

Output:

Data I want to extract

Use .text :

>>> soup.find('td').text
u'Data I want to extract'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM