[英]Downloading a sequence of webpages using Python
I am very new to Python [running 2.7.x] and I am trying to download content from a webpage with thousands of links.我对 Python [运行 2.7.x] 非常陌生,我正在尝试从具有数千个链接的网页下载内容。 Here's my code:
这是我的代码:
import urllib2
i = 1
limit = 1441
for i in limit:
url = 'http://pmindia.gov.in/content_print.php?nodeid='+i+'&nodetype=2'
response = urllib2.urlopen(url)
webContent = response.read()
f = open('speech'+i+'.html', 'w')
f.write(webContent)
f.close
Fairly elementary, but I get one or both of these errors 'int object is not iterable' or 'cannot concatenate str and int'.相当基本,但我得到了这些错误中的一个或两个,“int object is not iterable”或“cannot concatenate str and int”。 These are the printable versions of the links on this page: http://pmindia.gov.in/all-speeches.php (1400 links).
这些是本页链接的可打印版本: http : //pmindia.gov.in/all-speeches.php (1400 个链接)。 But the node id's go from 1 to 1441 which means 41 numbers are missing (which is a separate problem).
但是节点 ID 从 1 到 1441,这意味着缺少 41 个数字(这是一个单独的问题)。 Final final question: in the long run, while downloading thousands of link objects, is there a way to run them in parallel to increase processing speed?
最后最后一个问题:从长远来看,在下载数千个链接对象时,有没有办法并行运行它们以提高处理速度?
Try this:尝试这个:
for i in range(1, limit + 1):
...
range(M, N) returns a list of numbers from M (inclusive) to N (exclusive). range(M, N) 返回从 M(含)到 N(不含)的数字列表。 See https://docs.python.org/release/1.5.1p1/tut/range.html
见https://docs.python.org/release/1.5.1p1/tut/range.html
您可能想考虑使用Scrapy或其他一些网络爬行框架来帮助您解决这个问题。
There are a couple of mistakes in your code.您的代码中有几个错误。
With those fixes your code look like通过这些修复,您的代码看起来像
import urllib2
i = 1
limit = 1441
for i in xrange(1,limit+1):
url = 'http://pmindia.gov.in/content_print.php?nodeid='+repr(i)+'&nodetype=2'
response = urllib2.urlopen(url)
webContent = response.read()
f = open('speech'+repr(i)+'.html', 'w')
f.write(webContent)
f.close
Now, if you want to go into web scraping for real, I suggest you have a look at some packages such as lxml and requests现在,如果您想真正进行网络抓取,我建议您查看一些包,例如lxml和requests
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.