[英]Python - downloading application/csv data from webpage
I'm using requests
library to fetch a particular webpage which contains a link to download data in csv. 我正在使用
requests
库来获取特定的网页,其中包含在csv中下载数据的链接。 The link is of the format 链接具有格式
<a class="csv-download" download="data.csv" target"_blank"="" style="cursor:pointer" href="data:application/csv;charset=utf-8,%22Date%22%2C%22Volume%2FLength%22%2C%22Length%2FWidth%22%2C%22Weight%20gm%22%0A%2208-Jan-2018%22%2C%22%20%20%20%20%20%20%2023.19%22%2C%22%20%20%20%20%20%20%20%202.13%22%2C%22%20%20%20%20%20%20%20%201.32%22%0A" target="_blank">Download csv</a>
This link when clicked from the browser downloads the data in a file download.csv
从浏览器单击此链接时,将数据下载到
download.csv
文件中
I need to extract this as a csv and store to file. 我需要将其解压缩为csv并存储到文件中。 I'm using
BeautifulSoup
in the project for parsing HTML files. 我在项目中使用
BeautifulSoup
来解析HTML文件。
How do I go about downloading the csv file from Python? 如何从Python下载csv文件?
Here is what I have so far 这是我到目前为止所拥有的
import requests
from bs4 import BeautifulSoup as BS
r = requests.get(url)
soup = BS(r.text)
target_elt = soup.find('a', "csv-download")
# TODO - download the csv data
Since the contents of the file you need are stored in the href
attribute of target_elt
, starting after the comma, you can split the contents of that attribute on the first comma, then decode the portion after that first comma: 由于您需要的文件内容存储在
target_elt
的href
属性中,从逗号开始,您可以在第一个逗号上拆分该属性的内容,然后在第一个逗号之后解码该部分:
import urllib
import requests
from bs4 import BeautifulSoup as BS
r = requests.get(url)
soup = BS(r.text)
target_elt = soup.find('a', "csv-download")
header, encoded = target_elt.attrs["href"].split(",", 1)
data = urllib.unquote(encoded)
with open("data.csv", "w") as fp:
fp.write(data)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.