繁体   English   中英

使用Python下载网站

[英]Using Python to download website

尝试使用python下载网站,但出现错误。 我的意图是下载该网站,使用python从中提取相关信息,然后将结果保存到硬盘上的另一个文件中。 步骤1出现问题。其他步骤一直有效,直到出现一些奇怪的SSL错误。 我正在使用python 2.7

import urllib
testsite = urllib.URLopener()
testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")

这是发生了什么:

Traceback (most recent call last):
  File "C:\Users\Xaero\Desktop\Python\class related\scratch.py", line 10, in <module>
    testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")
  File "C:\Python27\lib\urllib.py", line 237, in retrieve
    fp = self.open(url, data)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 435, in open_https
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 940, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 803, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 755, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 1156, in connect
    self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
  File "C:\Python27\lib\ssl.py", line 342, in wrap_socket
    ciphers=ciphers)
  File "C:\Python27\lib\ssl.py", line 121, in __init__
    self.do_handshake()
  File "C:\Python27\lib\ssl.py", line 281, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

在网上做了一些研究,结果发现Piratebay非常不适合python。 我找到了一些赋予它不同的用户代理的代码,并使其加载页面,但是最近这也停止了工作。 > _ <

产生相同的错误:

import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice

today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']


class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')

外面有人能成功做到吗?

您是否尝试过使用硒?

pip install selenium

有关更多安装说明,请参见此处

首次进口硒:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

然后启动网络驱动程序并加载页面:

driver = webdriver.Firefox()
driver.get("https://thepiratebay.se/top/207")

更新python install修复了它。 我猜我有2.7.0,更新到2.7.11,问题就消失了。

现在可以完美检索页面:

import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice

today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']


class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')

虽然,硒也很有趣。 我会看看。 谢谢您的帮助! = d

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM