简体   繁体   English

如何使用Python从html标记提取数据?

[英]How can I extract data from a html tag using Python?

I want to extract the translation of a word in online dictionary. 我想提取在线词典中单词的翻译。 For example, the html code for 'car': 例如,“ car”的html代码:

<ol class="sense_list level_1">
     <li class="sense_list_item level_1" value="1"><span class="def">any vehicle on wheels</span></li>

How can I extract "any vehicle on wheels" in Python with beautifulsoup or any other modules? 如何使用Beautifulsoup或任何其他模块在Python中提取“车轮上的任何车辆”?

There are multiple ways to reach the desired element. 有多种方法可以达到所需的元素。

Probably the simplest would be to find it by class : 可能最简单的方法是按class找到它:

soup.find('span', class_='def').text

or, with a CSS selector : 或者,使用CSS selector

soup.select('span.def')[0].text

or, additionally checking the parents: 或者,另外检查父母:

soup.select('ol.level_1 > li.level_1 > span.def')[0].text

or: 要么:

soup.select('ol.level_1 > li[value=1] > span.def')[0].text

I solve it by beautifulsoup: 我通过beautifulsoup解决了这个问题:

soup = bs4.BeautifulSoup(html)
q1=soup.find('li', class_="sense_list_item level_1",value='1').text

Assuming that is the only HTML code given, you can use NLTK . 假设这是给定的唯一HTML代码,则可以使用NLTK

import nltk 

#load html chunk into variable htmlstring#
extract = nltk.clean_html(htmlstring)
print(extract)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM