简体   繁体   中英

How can I extract data from a html tag using Python?

I want to extract the translation of a word in online dictionary. For example, the html code for 'car':

<ol class="sense_list level_1">
     <li class="sense_list_item level_1" value="1"><span class="def">any vehicle on wheels</span></li>

How can I extract "any vehicle on wheels" in Python with beautifulsoup or any other modules?

There are multiple ways to reach the desired element.

Probably the simplest would be to find it by class :

soup.find('span', class_='def').text

or, with a CSS selector :

soup.select('span.def')[0].text

or, additionally checking the parents:

soup.select('ol.level_1 > li.level_1 > span.def')[0].text

or:

soup.select('ol.level_1 > li[value=1] > span.def')[0].text

I solve it by beautifulsoup:

soup = bs4.BeautifulSoup(html)
q1=soup.find('li', class_="sense_list_item level_1",value='1').text

Assuming that is the only HTML code given, you can use NLTK .

import nltk 

#load html chunk into variable htmlstring#
extract = nltk.clean_html(htmlstring)
print(extract)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM