[英]download csv file using Python from the web
Goal : want to automatize the download of various .csv files from http://www.tocom.or.jp/historical/download.html using Python (this is not the main issue though) 目标 :希望使用Python自动化从http://www.tocom.or.jp/historical/download.html下载各种.csv文件(但这不是主要问题)
Specifics : in particular, I am trying to download the csv files for the "Tick Data" (fifth heading from the bottom, for the 5 days available. 细节 :特别是,我正在尝试下载“Tick Data”的csv文件(从底部开始的第五个标题,可用的5天)。
Problem : when I see the source code for this web page, look for "Tick Data" I see the references to these 5 .csv files but they're not with the usual href tag. 问题 :当我看到此网页的源代码时,查找“Tick Data”我看到对这5个.csv文件的引用,但它们没有通常的href标记。 As I am using Python (urllib) I need to know the URLs of these 5 .csv files but don't know how to get them. 当我使用Python(urllib)时,我需要知道这5个.csv文件的URL,但不知道如何获取它们。
This is not a question of Python per se, but about how to find the URL of some .csv that can be downloaded from a web page. 这不是Python本身的问题,而是关于如何找到可以从网页下载的某些.csv的URL。 Hence, no code is provided. 因此,没有提供代码。
The page uses JavaScript to create the URL: 该页面使用JavaScript来创建URL:
<select name="tick">
<option value="TOCOMprice_20121122.csv">Nov 22, 2012</option>
<option value="TOCOMprice_20121121.csv">Nov 21, 2012</option>
<option value="TOCOMprice_20121120.csv">Nov 20, 2012</option>
<option value="TOCOMprice_20121119.csv">Nov 19, 2012</option>
<option value="TOCOMprice_20121116.csv">Nov 16, 2012</option>
</select>
<input type="button" onClick="location.href='/data/tick/' + document.form.tick.value;"
value="Download" style="width:7em;" />
It combines a path, that the browser will use against the current site. 它结合了浏览器将使用的路径与当前站点。 So each URL is: 所以每个URL是:
http://www.tocom.or.jp + /data/tick/ + TOCOMprice_*yearmonthday*.csv
By the looks of it, the data only covers weekdays. 从外观上看,数据仅涵盖工作日。
These are easy enough to cobble together into automated URLs: 这些很容易拼凑成自动化的URL:
import requests
from datetime import datetime, timedelta
start = datetime.now() - timedelta(days=1)
base = 'http://www.tocom.or.jp/data/tick/TOCOMprice_'
next = start
for i in range(5):
r = requests.get(base + next.strftime('%Y%m%d') + '.csv')
# Save r.content somewhere
next += timedelta(days=1)
while next.weekday() >= 5: # Sat = 5, Sun = 6
next += timedelta(days=1)
I used requests
for it's easier-to-use API, but you can use urllib2
for this task too if you so wish. 我使用了更容易使用的API requests
,但如果您愿意,也可以使用urllib2
来完成此任务。
Use Chrome w/Dev Tools, Firefox w/Firebug or Fiddler to look at the request URL when you hit the download button. 点击下载按钮,使用Chrome w / Dev Tools,Firefox w / Firebug或Fiddler查看请求URL。
(for example, I see this for Nov 22: http://www.tocom.or.jp/data/tick/TOCOMprice_20121122.csv ) (例如,我在11月22日看到这个: http : //www.tocom.or.jp/data/tick/TOCOMprice_20121122.csv )
You can determine the download link Using the developer menu of your browser, among other things. 您可以确定下载链接使用浏览器的开发人员菜单等。 I use chrome, and I'm shown that the link is 我使用chrome,我发现链接是
http://www.tocom.or.jp/data/souba_d/souba_d_20121126_20121123_0425.csv http://www.tocom.or.jp/data/souba_d/souba_d_20121126_20121123_0425.csv
That URL structure seems pretty straight forward to guess, and another link right on the page: 该URL结构似乎很容易猜到,并且页面上有另一个链接:
http://www.tocom.or.jp/historical/keishiki_souba_d.html http://www.tocom.or.jp/historical/keishiki_souba_d.html
Indicates how to structure pulls. 指示如何构造拉力。 A good bet is just to structure the csv pulls in 5 minute intervals. 一个好的选择就是以5分钟的间隔构建csv拉力。
Good luck! 祝好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.