使用BeautifulSoup進行Python網絡抓取，如何循環訪問復雜的URL？

Question

因此，我正在嘗試從以下網站抓取《佛羅里達州法規》：www.leg.state.fl.us/Statutes/

到目前為止，我只能刮開第一章： http : //www.leg.state.fl.us/Statutes/index.cfm ?App_mode=Display_Statute&URL=0000-0099/0001/0001.html。

我注意到該URL更改為“ URL = 0000-0099 / 0002 / 0002.html”。 當我跳到下一章時。 我的問題是，我該如何以可以刮擦所有章節的方式進行編碼？ （URL 0000-0099的第一部分是各章的范圍，因此在這種情況下，它將是從第一章到第99章）

我的代碼如下：

from bs4 import BeautifulSoup
import urllib2

f = open('C:\Python27\projects\outflieFS_final.txt','w')

def First_part(url):
  thepage = urllib2.urlopen(url)
  soupdata = BeautifulSoup(thepage,'html.parser')
  return soupdata

soup = First_part("http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html")

tableContents = soup.find('div', {'id': 'statutes' })

for data in tableContents.findAll('div'):
   data = data.text.encode("utf-8","ignore")
   data = str(data)+ "\n\n"
   f.write(data)
f.close()

Answer 1

進行循環並使用字符串格式來形成URL：

base_url = "http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range(1, 100):
    url = base_url.format(chapter=chapter)
    print(url)
    # make a request and parse the page

這將產生以下URL：

http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0002/0002.html
...
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0098/0098.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0099/0099.html

使用BeautifulSoup進行Python網絡抓取，如何循環訪問復雜的URL？

問題描述

1 個解決方案

解決方案1
0 已采納 2016-03-21 02:08:24

使用BeautifulSoup進行Python網絡抓取，如何循環訪問復雜的URL？

問題描述

1 個解決方案

解決方案1 0 已采納 2016-03-21 02:08:24

解決方案1
0 已采納 2016-03-21 02:08:24