在python中確定網站上的站點數

Question

我有以下鏈接：

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

網址的參考部分具有以下信息：

A7 ==議會（當前是第七議會，前一個是A6，依此類推）

2010 ==年

0001 ==文件編號

對於每年的國會，我想確定網站上文件的數量。 例如，對於2010年，編號186、195,196的頁面為空，而最大編號為214，則使任務復雜化。理想情況下，輸出應為包含所有文檔編號的矢量，不包括丟失的文檔編號。

誰能告訴我在python中是否可行？

最好，托馬斯

Answer 1

首先，請確保刮除其網站是合法的。

其次，請注意，當不存在文檔時，HTML文件包含：

<title>Application Error</title>

第三，使用urllib遍歷您想要的所有內容：

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

Answer 2

這是一個似乎更完整（但很棘手）的示例，該示例似乎可行（使用urllib2）-我確定您可以根據自己的特定需求對其進行定制。

我還要重復Arrieta關於確保網站所有者不介意您抓取其內容的警告。

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

Answer 3

這是一個解決方案，但是在請求之間添加一些超時是個好主意：

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

在python中確定網站上的站點數

問題描述

3 個解決方案

解決方案1
3 2010-07-09 05:45:12

解決方案2
1 2010-07-09 06:13:39

解決方案3
1 已采納 2010-07-09 06:18:30

在python中確定網站上的站點數

問題描述

3 個解決方案

解決方案1 3 2010-07-09 05:45:12

解決方案2 1 2010-07-09 06:13:39

解決方案3 1 已采納 2010-07-09 06:18:30

解決方案1
3 2010-07-09 05:45:12

解決方案2
1 2010-07-09 06:13:39

解決方案3
1 已采納 2010-07-09 06:18:30