如何在Python 2中下载大文件

Question

I'm trying to download large files (approx. 1GB) with mechanize module, but I have been unsuccessful. 我正在尝试使用机械化模块下载大文件（大约1GB），但我没有成功。 I've been searching for similar threads, but I have found only those, where the files are publicly accessible and no login is required to obtain a file. 我一直在寻找类似的线程，但我只找到了那些文件可公开访问的文件，并且无需登录即可获取文件。 But this is not my case as the file is located in the private section and I need to login before the download. 但这不是我的情况，因为该文件位于私有部分，我需要在下载之前登录。 Here is what I've done so far. 这是我到目前为止所做的。

import mechanize

g_form_id = ""

def is_form_found(form1):
    return "id" in form1.attrs and form1.attrs['id'] == g_form_id

def select_form_with_id_using_br(br1, id1):
    global g_form_id
    g_form_id = id1
    try:
        br1.select_form(predicate=is_form_found)
    except mechanize.FormNotFoundError:
        print "form not found, id: " + g_form_id
        exit()

url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"

br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders =  [('User-agent', 'Firefox')]

response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email@domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False)    # allow everything to be written to
br.submit()

# Try to download file
br.retrieve(url_to_file, local_filename)

But I'm getting an error when 512MB is downloaded: 但下载512MB时出现错误：

Traceback (most recent call last):
  File "dl.py", line 34, in <module>
    br.retrieve(br.retrieve(url_to_file, local_filename)
  File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
    block = fp.read(bs)
  File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
    self.__cache.write(data)
MemoryError: out of memory

Do you have any ideas how to solve this? 你有任何想法如何解决这个问题？ Thanks 谢谢

Answer 1

You can use bs4 and requests to get you logged in then write the streamed content. 您可以使用bs4和请求登录，然后编写流内容。 There are a few form fields required including a _token_ field that is definitely necessary: 需要一些表单字段，包括绝对必要的_token_字段：

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin

data = {'email': 'email@domain.com', 'password': 'password'}
base = "https://support.codasip.com"

with requests.Session() as s:
    # update headers
    s.headers.update({'User-agent': 'Firefox'})

    # use bs4 to parse the from fields
    soup = BeautifulSoup(s.get(base).content)
    form = soup.select_one("#frm-loginForm")
    # works as it is a relative path. Not always the case.
    action = form["action"]

    # Get rest of the fields, ignore password and email.
    for inp in form.find_all("input", {"name":True,"value":True}):
        name, value = inp["name"], inp["value"]
        if name not in data:
            data[name] = value
    # login
    s.post(urljoin(base, action), data=data)
    # get protected url
    with open(local_filename, "wb") as f:
        for chk in s.get(url_to_file, stream=True).iter_content(1024):
            f.write(chk)

Answer 2

Try downloading/writing it by chunks. 尝试通过块下载/写入它。 Seems like file takes all your memory. 好像文件占用了你所有的记忆。

You should specify Range header for your request if server supports it. 如果服务器支持请求，则应为请求指定Range标头。

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

如何在Python 2中下载大文件

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-09-29 10:50:54

解决方案2
-1 2016-09-29 10:15:55

如何在Python 2中下载大文件

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-09-29 10:50:54

解决方案2 -1 2016-09-29 10:15:55

解决方案1
1 已采纳 2016-09-29 10:50:54

解决方案2
-1 2016-09-29 10:15:55