繁体   English   中英

从 url 下载所有 zip 文件

[英]Downloading all zip files from url

我需要从以下网址下载所有 zip 文件: https ://www.ercot.com/mp/data-products/data-product-details?id=NP7-802-M

压缩文件如图所示: 在此处输入图像描述

我正在尝试以下代码:

import urllib.request
import zipfile
url = fr'https://www.ercot.com/mp/data-products/data-product-details?id=NP7-802-M'
response = urllib.request.urlopen(url)
html = response.read()

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

# Find all the links to zip files (it finds nothing!!!!)
zip_links = soup.find_all("a", href=lambda x: x and x.endswith(".zip"))

# Extract the URLs of the zip files
zip_urls = [link["href"] for link in zip_links]

def download_zip(url):
    # Download the zip file
    response = urllib.request.urlopen(url)
    # Save the zip file to a local file
    zip_data = response.read()
    with open("zip_file.zip", "wb") as zip_file:
        zip_file.write(zip_data)

for url in zip_urls:
    download_zip(url)

我已经尝试了上面的不同版本,但到目前为止没有成功。 我不确定如何进行。

您需要的一切都来自一个端点,您可以查询该端点,然后下载所有 zip 文件。

这是如何做:

import os
import time
from pathlib import Path
from shutil import copyfileobj

import requests

endpoint = f"https://www.ercot.com/misapp/servlets/" \
           f"IceDocListJsonWS?reportTypeId=11203&_={int(time.time())}"

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, utf-8",
    "Host": "www.ercot.com",
    "Referer": "https://www.ercot.com/mp/data-products/data-product-details?id=NP7-802-M",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:108.0) Gecko/20100101 Firefox/108.0",
    "X-Requested-With": "XMLHttpRequest"
}

os.makedirs("zip_files", exist_ok=True)
download_url = "https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId="

with requests.Session() as s:
    auction_results = s.get(endpoint, headers=headers).json()
    for result in auction_results["ListDocsByRptTypeRes"]["DocumentList"]:
        file_name = result["Document"]["ConstructedName"]
        zip_url = f"{download_url}{result['Document']['ReportTypeID']}"
        print(f"Downloading {result['Document']['FriendlyName']}...")
        r = s.get(zip_url, headers=headers, stream=True)
        with open(Path("zip_files") / file_name, 'wb') as f:
            copyfileobj(r.raw, f)

下载输出:

Downloading 20232nd6AnnualAuctionSeq2CRRAuctionResults...
Downloading 20231st6AnnualAuctionSeq1CRRAuctionResults...
Downloading 20251st6AnnualAuctionSeq6CRRAuctionResults...
...

这应该给你一个名为zip_files的文件夹,其中包含:

zip_files/
├── rpt.00011203.0000000000000000.20200103.100105725.20211st6AnnualAuctionSeq3CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200206.100132142.20212nd6AnnualAuctionSeq4CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200305.100126722.20221st6AnnualAuctionSeq5CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200402.100100899.20222nd6AnnualAuctionSeq6CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200430.100110058.20202nd6AnnualAuctionSeq1CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200604.100110480.20211st6AnnualAuctionSeq2CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200702.100115453.20212nd6AnnualAuctionSeq3CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200806.100118989.20221st6AnnualAuctionSeq4CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20200903.100055873.20222nd6AnnualAuctionSeq5CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20201001.100059887.20231st6AnnualAuctionSeq6CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20201105.100138522.20211st6AnnualAuctionSeq1CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20201203.100105088.20212nd6AnnualAuctionSeq2CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210105.100208000.20221st6AnnualAuctionSeq3CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210204.100109061.20222nd6AnnualAuctionSeq4CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210304.100107618.20231st6AnnualAuctionSeq5CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210401.100114752.20232nd6AnnualAuctionSeq6CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210506.100121795.20212nd6AnnualAuctionSeq1CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210603.100129464.20221st6AnnualAuctionSeq2CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210701.100127848.20222nd6AnnualAuctionSeq3CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210805.100115654.20231st6AnnualAuctionSeq4CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210902.100111410.20232nd6AnnualAuctionSeq5CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20210930.100118983.20241st6AnnualAuctionSeq6CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20211104.100126210.20221st6AnnualAuctionSeq1CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20211202.100129217.20222nd6AnnualAuctionSeq2CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220104.100123693.20231st6AnnualAuctionSeq3CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220203.100118822.20232nd6AnnualAuctionSeq4CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220303.100110243.20241st6AnnualAuctionSeq5CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220331.100104447.20242nd6AnnualAuctionSeq6CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220505.100109570.20222nd6AnnualAuctionSeq1CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220602.100104646.20231st6AnnualAuctionSeq2CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220630.100107920.20232nd6AnnualAuctionSeq3CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220804.100105512.20241st6AnnualAuctionSeq4CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20220901.100103704.20242nd6AnnualAuctionSeq5CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20221006.100123996.20251st6AnnualAuctionSeq6CRRAuctionResults.zip
├── rpt.00011203.0000000000000000.20221103.100114498.20231st6AnnualAuctionSeq1CRRAuctionResults.zip
└── rpt.00011203.0000000000000000.20221201.130125954.20232nd6AnnualAuctionSeq2CRRAuctionResults.zip

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM