简体   繁体   English

在Python中使用BeautifulSoup从html提取数据

[英]Extracting data from html using BeautifulSoup in Python

I an trying to extract data from a website using beautifulSoup. 我试图使用beautifulSoup从网站中提取数据。 I want to extract data from this HTML Snippet 我想从此HTML代码段中提取数据

 <ul class="result-info info-bro-6 cc" style="display: block;"> <li> <strong>MODERATED</strong> <ul class="cc"> <li> Health </li> <li> C**** </li> <li> C******* </li> <li> D**** </li> <li> Di8**** </li> <li> Di**** </li> <li> F******* </li> <li> Fi****** </li> <li> L****** </li> <li> M**** </li> <li> NM***** </li> <li> P****** </li> <li> Pr***** </li> <li> Sp**** </li> <li> *******e </li> </ul> </li> <li> <strong> ********* </strong> <ul class="cc"> <li>*** /****</li> </ul> </li> </ul> 

The data i want to extract is "*** /****". 我要提取的数据是“ *** / ****”。 I want my code to return this and only this, however the code i currently have is returning all the data within the li tags. 我希望我的代码仅返回此内容,但是我目前拥有的代码将返回li标记内的所有数据。 How could i extract only the data i want? 我怎样才能只提取我想要的数据?

This is my current code: 这是我目前的代码:

 from bs4 import BeautifulSoup import requests html = """<ul class="result-info info-bro-6 cc" style="display: block;"> <li> <strong>H*******</strong> <ul class="cc"> <li> H***** </li> <li> C**** </li> <li> C******* </li> <li> D**** </li> <li> Di***** </li> <li> Di**** </li> <li> F******* </li> <li> Fi****** </li> <li> L****** </li> <li> M**** </li> <li> NM***** </li> <li> P****** </li> <li> Pr***** </li> <li> Sp**** </li> <li> *******e </li> </ul> </li> <li> <strong> ********* </strong> <ul class="cc"> <li>*** /****</li> </ul> </li> </ul>""" soup = BeautifulSoup(html) for ultag in soup.find_all('ul', {'class': 'cc'}): for litag in ultag.find_all('li'): print(litag.text) 

As you've noticed, there are a bunch of ul tags with class=cc . 您已经注意到,有一堆带有class=ccul标签。 You'll need to find a consistency in your HTML that'll allow you to grab that one and that one only. 您需要在HTML中找到一个一致性,使您可以同时抓取一个和那一个。

For example, the ul tag you want is the last one in your HTML. 例如,您想要的ul标记是HTML中的最后一个标记。 So instead of iterating through all the ul tags, just get the last one: 因此,无需遍历所有ul标签,只需获取最后一个标签即可:

ultag = soup.find_all('ul', {'class':'cc'})[-1]
litag = ultag.li
print(litag.text)

Unfortunately, if this doesn't work because there are more ul tags later on in your HTML code, then you'll need to make your navigating more specific. 不幸的是,如果由于稍后HTML代码中包含更多ul标签而导致此操作不起作用,则需要使导航更加具体。


If it is the last ul in the class result-info info-bro-6 cc , then perhaps this will help: 如果它是result-info info-bro-6 cc类中的最后一个ul ,那么这可能会有所帮助:

outer_ul = soup.select_one('ul.result-info.info-bro-6.cc')
last_ul = outer_ul.find_all('ul')[-1]
print(last_ul.text)

you can use next to find the next sibling of that tag 您可以使用next查找该标签的下一个同级

soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('ul', attrs={'class':'cc'})[2].next.next.text
print(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM