如何使用Python从html标记提取数据？

Question

I want to extract the translation of a word in online dictionary. 我想提取在线词典中单词的翻译。 For example, the html code for 'car': 例如，“ car”的html代码：

<ol class="sense_list level_1">
     <li class="sense_list_item level_1" value="1"><span class="def">any vehicle on wheels</span></li>

How can I extract "any vehicle on wheels" in Python with beautifulsoup or any other modules? 如何使用Beautifulsoup或任何其他模块在Python中提取“车轮上的任何车辆”？

Answer 1

There are multiple ways to reach the desired element. 有多种方法可以达到所需的元素。

Probably the simplest would be to find it by class : 可能最简单的方法是按class找到它：

soup.find('span', class_='def').text

or, with a CSS selector : 或者，使用CSS selector ：

soup.select('span.def')[0].text

or, additionally checking the parents: 或者，另外检查父母：

soup.select('ol.level_1 > li.level_1 > span.def')[0].text

or: 要么：

soup.select('ol.level_1 > li[value=1] > span.def')[0].text

Answer 2

I solve it by beautifulsoup: 我通过beautifulsoup解决了这个问题：

soup = bs4.BeautifulSoup(html)
q1=soup.find('li', class_="sense_list_item level_1",value='1').text

Answer 3

Assuming that is the only HTML code given, you can use NLTK . 假设这是给定的唯一HTML代码，则可以使用NLTK 。

import nltk 

#load html chunk into variable htmlstring#
extract = nltk.clean_html(htmlstring)
print(extract)