How can I extract data from a html tag using Python?

Question

I want to extract the translation of a word in online dictionary. For example, the html code for 'car':

<ol class="sense_list level_1">
     <li class="sense_list_item level_1" value="1"><span class="def">any vehicle on wheels</span></li>

How can I extract "any vehicle on wheels" in Python with beautifulsoup or any other modules?

Answer 1

There are multiple ways to reach the desired element.

Probably the simplest would be to find it by class :

soup.find('span', class_='def').text

or, with a CSS selector :

soup.select('span.def')[0].text

or, additionally checking the parents:

soup.select('ol.level_1 > li.level_1 > span.def')[0].text

or:

soup.select('ol.level_1 > li[value=1] > span.def')[0].text

Answer 2

I solve it by beautifulsoup:

soup = bs4.BeautifulSoup(html)
q1=soup.find('li', class_="sense_list_item level_1",value='1').text

Answer 3

Assuming that is the only HTML code given, you can use NLTK .

import nltk 

#load html chunk into variable htmlstring#
extract = nltk.clean_html(htmlstring)
print(extract)