简体   繁体   中英

Python | HTTP - How to check file size before downloading it

I am crawling the web using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)

The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.

I found this question - Link - and it suggests using urllib and urlopen . I don't want to contact the server twice.

I want to limit the file size to 25MB. Is there a way i can do this with urllib3 ?

If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.

To do this, you'll need to make sure that you're not preloading the full response .

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url, preload_content=False)

# Maximum amount we want to read  
max_bytes = 1000000

content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
    # Expected body is smaller than our maximum, read the whole thing
    data = response.read()
    # Do something with data
    ...
elif content_bytes is None:
    # Alternatively, stream until we hit our limit
    amount_read = 0
    for chunk in r.stream():
        amount_read += len(chunk)
        # Save chunk
        ...
        if amount_read > max_bytes:
            break

# Release the connection back into the pool
response.release_conn()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM