[英]Python Beautiful soup Retrieving html content inside and element id
我正在嘗試抓取HTML頁面
productsoup = BeautifulSoup(productdriver.page_source,"lxml");
該python腳本提供了包含以下元素ID部分的html
<div style="padding-top: 10px;" id="government_funding"> <h2>Sampling of Recent Funding Actions/Set Asides</h2> <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p> <ul> <li><span style="color: green;">$14,450</span> - Thursday the 17th of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NICHD<br>AVANTI POLAR LIPIDS:1109394 [17-010744] <hr> </li> <li><span style="color: green;">$5,455</span> - Thursday the 31st of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NICHD<br>AVANTI POLAR LIPIDS:1109394 [17-004567] <hr> </li> <li><span style="color: green;">$5,005</span> - Tuesday the 8th of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NIAID<br>CUSTOM LIPID SYNTHESIS (24:0-10:0 PE) 100 MG PACKAGED IN 10-10MG VIALS POWDER PER QUOTE #DQ-000665 <hr> </li> <li><span style="color: green;">$5,005</span> - Thursday the 17th of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NIAID<br>CUSTOM LIPID SYNTHESIS (24:0-10:0 PE) 100 MG PACKAGED IN 10-10MG VIALS POWDER PER QUOTE #DQ-000665 <hr> </li> </ul> </div>
這只是html的一部分,此部分由id =“ government_funding”標識。 為id =“ goverment_funding”中的每個li打印價格,日期,代理商。 所以一個li的輸出是
價格= $ 14,450
日期= 2017年8月17日
機構=國立衛生研究院
子機構=國立衛生研究院
我如何編碼上面的輸出?
數據源的鏈接是此https://www.collierreporting.com/company/avanti-polar-lipids-inc-alabaster-al
您可以遍歷li
標簽和后續的span
值,並使用re.findall
訪問數據:
import re
def all_data(d):
a, b = [i.text for i in d.find_all('span')]
return [a, *re.findall('\w+\sthe\s\w+\sof\s\w+\s\d+', d.text), b]
results = [all_data(b) for b in productsoup.find('div', {'id':'government_funding'}).find_all('li')]
輸出:
[['$14,450', 'Thursday the 17th of August 2017', 'National Institutes Of Health'], ['$5,455', 'Thursday the 31st of August 2017', 'National Institutes Of Health'], ['$5,005', 'Tuesday the 8th of August 2017', 'National Institutes Of Health'], ['$5,005', 'Thursday the 17th of August 2017', 'National Institutes Of Health']]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.