简体   繁体   English

使用 Python 下载一系列网页

[英]Downloading a sequence of webpages using Python

I am very new to Python [running 2.7.x] and I am trying to download content from a webpage with thousands of links.我对 Python [运行 2.7.x] 非常陌生,我正在尝试从具有数千个链接的网页下载内容。 Here's my code:这是我的代码:

import urllib2
i = 1
limit = 1441

for i in limit: 
    url = 'http://pmindia.gov.in/content_print.php?nodeid='+i+'&nodetype=2'
    response = urllib2.urlopen(url)
    webContent = response.read()
    f = open('speech'+i+'.html', 'w')
    f.write(webContent)
    f.close

Fairly elementary, but I get one or both of these errors 'int object is not iterable' or 'cannot concatenate str and int'.相当基本,但我得到了这些错误中的一个或两个,“int object is not iterable”或“cannot concatenate str and int”。 These are the printable versions of the links on this page: http://pmindia.gov.in/all-speeches.php (1400 links).这些是本页链接的可打印版本: http : //pmindia.gov.in/all-speeches.php (1400 个链接)。 But the node id's go from 1 to 1441 which means 41 numbers are missing (which is a separate problem).但是节点 ID 从 1 到 1441,这意味着缺少 41 个数字(这是一个单独的问题)。 Final final question: in the long run, while downloading thousands of link objects, is there a way to run them in parallel to increase processing speed?最后最后一个问题:从长远来看,在下载数千个链接对象时,有没有办法并行运行它们以提高处理速度?

Try this:尝试这个:

for i in range(1, limit + 1):
...

range(M, N) returns a list of numbers from M (inclusive) to N (exclusive). range(M, N) 返回从 M(含)到 N(不含)的数字列表。 See https://docs.python.org/release/1.5.1p1/tut/range.htmlhttps://docs.python.org/release/1.5.1p1/tut/range.html

您可能想考虑使用Scrapy或其他一些网络爬行框架来帮助您解决这个问题。

There are a couple of mistakes in your code.您的代码中有几个错误。

  1. You got the syntax of for wrong.你得到了错误的语法。 When you call the for loop, you need to pass it a an object that it can iterate on.当您调用 for 循环时,您需要向它传递一个可以迭代的对象。 This can be a list or a generator这可以是列表或生成器
  2. adding a number to a string won't work.将数字添加到字符串将不起作用。 You need to convert with for example repr您需要使用例如 repr 进行转换

With those fixes your code look like通过这些修复,您的代码看起来像

import urllib2
i = 1
limit = 1441

for i in xrange(1,limit+1): 
    url = 'http://pmindia.gov.in/content_print.php?nodeid='+repr(i)+'&nodetype=2'
    response = urllib2.urlopen(url)
    webContent = response.read()
    f = open('speech'+repr(i)+'.html', 'w')
    f.write(webContent)
    f.close

Now, if you want to go into web scraping for real, I suggest you have a look at some packages such as lxml and requests现在,如果您想真正进行网络抓取,我建议您查看一些包,例如lxmlrequests

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM