[英]Python web crawler : Connection Timed out
我正在尝试实现一个简单的Web爬网程序,并且已经编写了一个简单的代码开始:有两个模块fetcher.py和crawler.py 。 这些是文件:
fetcher.py:
import urllib2
import re
def fetcher(s):
"fetch a web page from a url"
try:
req = urllib2.Request(s)
urlResponse = urllib2.urlopen(req).read()
except urllib2.URLError as e:
print e.reason
return
p,q = s.split("//")
d = q.split("/")
fdes = open(d[0],"w+")
fdes.write(str(urlResponse))
fdes.seek(0)
return fdes
if __name__ == "__main__":
defaultSeed = "http://www.python.org"
print fetcher(defaultSeed)
crawler.py:
from bs4 import BeautifulSoup
import re
from fetchpage import fetcher
usedLinks = open("Used","a+")
newLinks = open("New","w+")
newLinks.seek(0)
def parse(fd,var=0):
soup = BeautifulSoup(fd)
for li in soup.find_all("a",href=re.compile("http")):
newLinks.seek(0,2)
newLinks.write(str(li.get("href")).strip("/"))
newLinks.write("\n")
fd.close()
newLinks.seek(var)
link = newLinks.readline().strip("\n")
return str(link)
def crawler(seed,n):
if n == 0:
usedLinks.close()
newLinks.close()
return
else:
usedLinks.write(seed)
usedLinks.write("\n")
fdes = fetcher(seed)
newSeed = parse(fdes,newLinks.tell())
crawler(newSeed,n-1)
if __name__ == "__main__":
crawler("http://www.python.org/",7)
问题是,当我运行crawler.py时,它对于前4-5个链接都可以正常工作,然后挂起,一分钟后出现以下错误:
[Errno 110] Connection timed out
Traceback (most recent call last):
File "crawler.py", line 37, in <module>
crawler("http://www.python.org/",7)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 33, in crawler
newSeed = parse(fdes,newLinks.tell())
File "crawler.py", line 11, in parse
soup = BeautifulSoup(fd)
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 191, in __init__
self._detectEncoding(markup, is_html)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 362, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
谁能帮我这个忙,我是python的新手,所以我无法找出为什么连接超时后会超时的信息 ?
连接超时不是特定于python的,它只是意味着您向服务器发出了请求,并且服务器在应用程序愿意等待的时间内没有响应。
很可能发生这种情况的原因是python.org可能具有某种机制来检测何时从脚本获取多个请求,并且可能仅在4-5个请求后完全停止提供页面。 除了在其他站点上尝试脚本之外,您没有真正能做的事情来避免这种情况。
您可以尝试使用代理来避免如上所述被多个请求检测到。 您可能想看看这个答案,以了解如何使用代理发送urllib请求: 如何通过Proxy-urllib打开网站-Python
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.