[英]Python web scraping using BeautifulSoup, Loop and skip certain URL value
[英]Python web scraping using BeautifulSoup, how to loop through complicated URL?
因此,我正在嘗試從以下網站抓取《佛羅里達州法規》:www.leg.state.fl.us/Statutes/
到目前為止,我只能刮開第一章: http : //www.leg.state.fl.us/Statutes/index.cfm ?App_mode=Display_Statute&URL=0000-0099/0001/0001.html。
我注意到該URL更改為“ URL = 0000-0099 / 0002 / 0002.html”。 當我跳到下一章時。 我的問題是,我該如何以可以刮擦所有章節的方式進行編碼? (URL 0000-0099的第一部分是各章的范圍,因此在這種情況下,它將是從第一章到第99章)
我的代碼如下:
from bs4 import BeautifulSoup
import urllib2
f = open('C:\Python27\projects\outflieFS_final.txt','w')
def First_part(url):
thepage = urllib2.urlopen(url)
soupdata = BeautifulSoup(thepage,'html.parser')
return soupdata
soup = First_part("http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html")
tableContents = soup.find('div', {'id': 'statutes' })
for data in tableContents.findAll('div'):
data = data.text.encode("utf-8","ignore")
data = str(data)+ "\n\n"
f.write(data)
f.close()
進行循環並使用字符串格式來形成URL:
base_url = "http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range(1, 100):
url = base_url.format(chapter=chapter)
print(url)
# make a request and parse the page
這將產生以下URL:
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0002/0002.html
...
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0098/0098.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0099/0099.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.