简体   繁体   中英

How to Download All Zip Files From a Website Using Python

I am trying to download all of the zipped files from this: https://www.google.com/googlebooks/uspto-patents-grants-text.html webpage.

Full disclosure, I am not a professional coder, so if I have made some dumb mistakes, please forgive me.

This is the code I have:

from bs4 import BeautifulSoup            
import requests

url = "https://www.google.com/googlebooks/uspto-patents-grants-text.html"
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")

for link in soup.find_all('a', href=True):
    href = link['href']

    if any(href.endswith(x) for x in ['.zip']):
    #if any(href.endswith('.zip')):
        print("Downloading '{}'".format(href))
        remote_file = requests.get(url + href)

        with open(href, 'wb') as f:
            for chunk in remote_file.iter_content(chunk_size=1024): 
                if chunk: 
                    f.write(chunk)  

The error I am getting when I run the code is: File "C:/Users/#USER#/#FILEPATH#/Python/patentzipscraper2.py", line 16, in with open(href, 'wb') as f: OSError: [Errno 22] Invalid argument: http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip '

However, when I type that address into a browser, I can download the zipped file. I am guessing this has something to do with the format of the zipped files, and that I can't necessarily download/open them directly, but I am not sure what. The code I was basing this off of was downloading files where you can clearly download directly (like .txt)

Any help on how to download these zips would be appreciated.

Implement in your code something like:

import urllib

archive = urllib.request.URLopener()
archive.retrieve("http://yoursite.com/file.zip", "file.zip")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM