简体   繁体   English

如何在Python 2中下载大文件

[英]How to download large files in Python 2

I'm trying to download large files (approx. 1GB) with mechanize module, but I have been unsuccessful. 我正在尝试使用机械化模块下载大文件(大约1GB),但我没有成功。 I've been searching for similar threads, but I have found only those, where the files are publicly accessible and no login is required to obtain a file. 我一直在寻找类似的线程,但我只找到了那些文件可公开访问的文件,并且无需登录即可获取文件。 But this is not my case as the file is located in the private section and I need to login before the download. 但这不是我的情况,因为该文件位于私有部分,我需要在下载之前登录。 Here is what I've done so far. 这是我到目前为止所做的。

import mechanize

g_form_id = ""

def is_form_found(form1):
    return "id" in form1.attrs and form1.attrs['id'] == g_form_id

def select_form_with_id_using_br(br1, id1):
    global g_form_id
    g_form_id = id1
    try:
        br1.select_form(predicate=is_form_found)
    except mechanize.FormNotFoundError:
        print "form not found, id: " + g_form_id
        exit()

url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"

br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders =  [('User-agent', 'Firefox')]

response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email@domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False)    # allow everything to be written to
br.submit()

# Try to download file
br.retrieve(url_to_file, local_filename)

But I'm getting an error when 512MB is downloaded: 但下载512MB时出现错误:

Traceback (most recent call last):
  File "dl.py", line 34, in <module>
    br.retrieve(br.retrieve(url_to_file, local_filename)
  File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
    block = fp.read(bs)
  File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
    self.__cache.write(data)
MemoryError: out of memory

Do you have any ideas how to solve this? 你有任何想法如何解决这个问题? Thanks 谢谢

You can use bs4 and requests to get you logged in then write the streamed content. 您可以使用bs4请求登录,然后编写内容。 There are a few form fields required including a _token_ field that is definitely necessary: 需要一些表单字段,包括绝对必要的_token_字段:

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin

data = {'email': 'email@domain.com', 'password': 'password'}
base = "https://support.codasip.com"

with requests.Session() as s:
    # update headers
    s.headers.update({'User-agent': 'Firefox'})

    # use bs4 to parse the from fields
    soup = BeautifulSoup(s.get(base).content)
    form = soup.select_one("#frm-loginForm")
    # works as it is a relative path. Not always the case.
    action = form["action"]

    # Get rest of the fields, ignore password and email.
    for inp in form.find_all("input", {"name":True,"value":True}):
        name, value = inp["name"], inp["value"]
        if name not in data:
            data[name] = value
    # login
    s.post(urljoin(base, action), data=data)
    # get protected url
    with open(local_filename, "wb") as f:
        for chk in s.get(url_to_file, stream=True).iter_content(1024):
            f.write(chk)

Try downloading/writing it by chunks. 尝试通过块下载/写入它。 Seems like file takes all your memory. 好像文件占用了你所有的记忆。

You should specify Range header for your request if server supports it. 如果服务器支持请求,则应为请求指定Range标头。

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM