简体   繁体   English

使用 Python 从 Blob URL 下载文件

[英]Download file from Blob URL with Python

I wish to have my Python script download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage .我希望我的 Python 脚本从这个法兰克福证券交易所网页下载主数据(下载,XLSX) Excel 文件。

When to retrieve it with urrlib and wget , it turns out that the URL leads to a Blob and the file downloaded is only 289 bytes and unreadable.何时使用urrlibwget检索它,结果表明 URL 指向一个Blob ,下载的文件只有 289 字节且不可读。

http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx

I'm entirely unfamiliar with Blobs and have these questions:我对 Blob 完全不熟悉,并且有以下问题:

  • Can the file "behind the Blob" be successfully retrieved using Python?可以使用 Python 成功检索“Blob 后面”的文件吗?

  • If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how?如果是这样,是否有必要发现 Blob 背后的“真实”URL——如果有这样的事情——以及如何发现? My concern here is that the link above won't be static but actually change often.我担心的是上面的链接不是静态的,而是经常变化的。

That 289 byte long thing might be a HTML code for 403 forbidden page.那个 289 字节长的东西可能是403 forbidden页面的 HTML 代码。 This happen because the server is smart and rejects if your code does not specify a user agent.发生这种情况是因为服务器很聪明,如果您的代码没有指定用户代理,它会拒绝。

Python 3蟒蛇 3

# python3
import urllib.request as request

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)

# print or write
print(f.read())

Python 2蟒蛇2

# python2
import urllib2

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
f = urllib2.urlopen(r)

print(f.read())
from bs4 import BeautifulSoup
import requests
import re

url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'
html=requests.get(url)
page=BeautifulSoup(html.content)
reg=re.compile('Master data')
find=page.find('span',text=reg)  #find the file url
file_url='http://www.xetra.com'+find.parent['href']
file=requests.get(file_url)
with open(r'C:\\Users\user\Downloads\file.xlsx','wb') as ff:
    ff.write(file.content)

recommend requests and BeautifulSoup,both good lib推荐 requests 和 BeautifulSoup,都是不错的库

for me, the target download url is like: blob: https://jdc.xxxx.com/2c21275b-c1ef-487a-a378-1672cd2d0760对我来说,目标下载 url 就像:blob: https://jdc.xxxx.com/2c21275b-c1ef-487a-a378-1672cd2d0760

i tried to write the original response in local xlsx, and found it was working.我试图在本地 xlsx 中编写原始响应,发现它正在工作。

import requests
        
r = requests.post(r'http://jdc.xxxx.com/rawreq/api/handler/desktop/export-rat-view?flag=jdc',json=data,headers={'content-type': 'application/json'})
file_name = 'file_name.xlsx'
with open(file_name, 'wb') as f:
    for chunk in r.iter_content(100000):
        f.write(chunk)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM