使用Python下载网站

Question

尝试使用python下载网站，但出现错误。 我的意图是下载该网站，使用python从中提取相关信息，然后将结果保存到硬盘上的另一个文件中。 步骤1出现问题。其他步骤一直有效，直到出现一些奇怪的SSL错误。 我正在使用python 2.7

import urllib
testsite = urllib.URLopener()
testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")

这是发生了什么：

Traceback (most recent call last):
  File "C:\Users\Xaero\Desktop\Python\class related\scratch.py", line 10, in <module>
    testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")
  File "C:\Python27\lib\urllib.py", line 237, in retrieve
    fp = self.open(url, data)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 435, in open_https
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 940, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 803, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 755, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 1156, in connect
    self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
  File "C:\Python27\lib\ssl.py", line 342, in wrap_socket
    ciphers=ciphers)
  File "C:\Python27\lib\ssl.py", line 121, in __init__
    self.do_handshake()
  File "C:\Python27\lib\ssl.py", line 281, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

在网上做了一些研究，结果发现Piratebay非常不适合python。 我找到了一些赋予它不同的用户代理的代码，并使其加载页面，但是最近这也停止了工作。 > _ <

产生相同的错误：

import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice

today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']


class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')

外面有人能成功做到吗？

Answer 1

您是否尝试过使用硒？

pip install selenium

有关更多安装说明，请参见此处。

首次进口硒：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

然后启动网络驱动程序并加载页面：

driver = webdriver.Firefox()
driver.get("https://thepiratebay.se/top/207")

Answer 2

更新python install修复了它。 我猜我有2.7.0，更新到2.7.11，问题就消失了。

现在可以完美检索页面：

import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice

today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']


class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')

虽然，硒也很有趣。 我会看看。 谢谢您的帮助！ = d

使用Python下载网站

问题描述

2 个解决方案

解决方案1
0 2016-01-11 01:22:25

解决方案2
0 2016-01-11 02:34:26

使用Python下载网站

问题描述

2 个解决方案

解决方案1 0 2016-01-11 01:22:25

解决方案2 0 2016-01-11 02:34:26

解决方案1
0 2016-01-11 01:22:25

解决方案2
0 2016-01-11 02:34:26