How to Download All Zip Files From a Website Using Python

Question

I am trying to download all of the zipped files from this: https://www.google.com/googlebooks/uspto-patents-grants-text.html webpage.

Full disclosure, I am not a professional coder, so if I have made some dumb mistakes, please forgive me.

This is the code I have:

from bs4 import BeautifulSoup            
import requests

url = "https://www.google.com/googlebooks/uspto-patents-grants-text.html"
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")

for link in soup.find_all('a', href=True):
    href = link['href']

    if any(href.endswith(x) for x in ['.zip']):
    #if any(href.endswith('.zip')):
        print("Downloading '{}'".format(href))
        remote_file = requests.get(url + href)

        with open(href, 'wb') as f:
            for chunk in remote_file.iter_content(chunk_size=1024): 
                if chunk: 
                    f.write(chunk)

The error I am getting when I run the code is: File "C:/Users/#USER#/#FILEPATH#/Python/patentzipscraper2.py", line 16, in with open(href, 'wb') as f: OSError: [Errno 22] Invalid argument: http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip '

However, when I type that address into a browser, I can download the zipped file. I am guessing this has something to do with the format of the zipped files, and that I can't necessarily download/open them directly, but I am not sure what. The code I was basing this off of was downloading files where you can clearly download directly (like .txt)

Any help on how to download these zips would be appreciated.

Answer 1

Implement in your code something like:

import urllib

archive = urllib.request.URLopener()
archive.retrieve("http://yoursite.com/file.zip", "file.zip")

How to Download All Zip Files From a Website Using Python

Question

1 answers

solution1
0 2017-07-27 16:05:11

How to Download All Zip Files From a Website Using Python

Question

1 answers

solution1 0 2017-07-27 16:05:11

solution1
0 2017-07-27 16:05:11