简体   繁体   English

在python中确定网站上的站点数

[英]Determining number of sites on a website in python

I have the following link: 我有以下链接:

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

the reference part of the url has the following information: 网址的参考部分具有以下信息:

A7 == The parliament (current is the seventh parliament, the former is A6 and so forth) A7 ==议会(当前是第七议会,前一个是A6,依此类推)

2010 == year 2010 ==年

0001 == document number 0001 ==文件编号

For every year and parliament I would like to identify the number of documents on the website. 对于每年的国会,我想确定网站上文件的数量。 The task is complicated by the fact that for 2010, for instance, numbers 186, 195,196 have empty pages, while the max number is 214. Ideally the output should be a vector with all the document numbers, excluding the missing ones. 例如,对于2010年,编号186、195,196的页面为空,而最大编号为214,则使任务复杂化。理想情况下,输出应为包含所有文档编号的矢量,不包括丢失的文档编号。

Can anyone tell me if this is possible in python? 谁能告诉我在python中是否可行?

Best, Thomas 最好,托马斯

First, make sure that scraping their site is legal. 首先,请确保刮除其网站是合法的。

Second, notice that when a document is not present, the HTML file contains: 其次,请注意,当不存在文档时,HTML文件包含:

<title>Application Error</title>

Third, use urllib to iterate over all the things you want to: 第三,使用urllib遍历您想要的所有内容:

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

Here's a slightly more complete (but hacky) example which seems to work(using urllib2) - I'm sure you can customise it for your specific needs. 这是一个似乎更完整(但很棘手)的示例,该示例似乎可行(使用urllib2)-我确定您可以根据自己的特定需求对其进行定制。

I'd also repeat Arrieta's warning about making sure the site's owner doesn't mind you scraping it's content. 我还要重复Arrieta关于确保网站所有者不介意您抓取其内容的警告。

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

Here is a solution, but adding some timeout between request is a good idea: 这是一个解决方案,但是在请求之间添加一些超时是个好主意:

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM