简体   繁体   中英

How can we download multiple Excel files from a URL?

I am trying to download all the Excel files from the URL beloe.

https://healthcare.ascension.org/price-transparency/price-transparency-files

Here is my hacky code.

from bs4 import BeautifulSoup
import urllib.request


for numb in ('1', '10'):
    resp = urllib.request.urlopen("https://healthcare.ascension.org/price-transparency/price-transparency-files")
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):
        if 'xls' in link['href']:
            print(link['href'])
            dls = link['href']
            urllib.request.urlretrieve(dls, str(numb) + ".xls")

I just threw that together, based on some Google searches I just did. When I run that code, I get this error.

ValueError: unknown url type: '/-/media/project/ascension/healthcare/price-transparency-files/al/630578923_ascension-saint-vincents-east_standardcharges.xlsx'

I just looped 1:10, because I'm not sure how to get the actual names of the Excel files, but a look behind the page shows that the Excel URLs look like this.

在此处输入图像描述

Each Excel file has a sheet named 'Standard Charges'. I'm not sure if I have to download each file, or just open it and copy the data from the sheet named 'Standard Charges', but basically I'm trying to get everything from 'Standard Charges' into one single data frame. When I look at a few of the sheets, I can quickly tell that the headers are sometimes slightly different, but I think 'pd.concat' should be able to handle that pretty seamlessly. Any idea how I can get everything into one data frame? Thanks.

use User-Agent to get successful request you can find in Network call

import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")

Code for writing to Excel file:

for link in soup.find_all('a', href=True):
    if 'xls' in link['href']:
        url="https://healthcare.ascension.org"+link['href']
        file="tmp/"+url.split("/")[-1].split(".")[0]+".xls"
        print(file)
        urllib.request.urlretrieve(url, file)  
    

Here URL will hold only last part so you have to concate with base url to make call for dowloading Excel file in local

Output: 在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM