简体   繁体   English

从Web下载Python下载csv文件

[英]download csv file using Python from the web

Goal : want to automatize the download of various .csv files from http://www.tocom.or.jp/historical/download.html using Python (this is not the main issue though) 目标 :希望使用Python自动化从http://www.tocom.or.jp/historical/download.html下载各种.csv文件(但这不是主要问题)

Specifics : in particular, I am trying to download the csv files for the "Tick Data" (fifth heading from the bottom, for the 5 days available. 细节 :特别是,我正在尝试下载“Tick Data”的csv文件(从底部开始的第五个标题,可用的5天)。

Problem : when I see the source code for this web page, look for "Tick Data" I see the references to these 5 .csv files but they're not with the usual href tag. 问题 :当我看到此网页的源代码时,查找“Tick Data”我看到对这5个.csv文件的引用,但它们没有通常的href标记。 As I am using Python (urllib) I need to know the URLs of these 5 .csv files but don't know how to get them. 当我使用Python(urllib)时,我需要知道这5个.csv文件的URL,但不知道如何获取它们。

This is not a question of Python per se, but about how to find the URL of some .csv that can be downloaded from a web page. 这不是Python本身的问题,而是关于如何找到可以从网页下载的某些.csv的URL。 Hence, no code is provided. 因此,没有提供代码。

The page uses JavaScript to create the URL: 该页面使用JavaScript来创建URL:

<select name="tick">
  <option value="TOCOMprice_20121122.csv">Nov 22, 2012</option>
  <option value="TOCOMprice_20121121.csv">Nov 21, 2012</option>
  <option value="TOCOMprice_20121120.csv">Nov 20, 2012</option>
  <option value="TOCOMprice_20121119.csv">Nov 19, 2012</option>
  <option value="TOCOMprice_20121116.csv">Nov 16, 2012</option>
</select>
  <input type="button" onClick="location.href='/data/tick/' + document.form.tick.value;" 
        value="Download" style="width:7em;" />

It combines a path, that the browser will use against the current site. 它结合了浏览器将使用的路径与当前站点。 So each URL is: 所以每个URL是:

http://www.tocom.or.jp + /data/tick/ + TOCOMprice_*yearmonthday*.csv

By the looks of it, the data only covers weekdays. 从外观上看,数据仅涵盖工作日。

These are easy enough to cobble together into automated URLs: 这些很容易拼凑成自动化的URL:

import requests
from datetime import datetime, timedelta

start = datetime.now() - timedelta(days=1)
base = 'http://www.tocom.or.jp/data/tick/TOCOMprice_'

next = start
for i in range(5):
    r = requests.get(base + next.strftime('%Y%m%d') + '.csv')
    # Save r.content somewhere
    next += timedelta(days=1)
    while next.weekday() >= 5:  # Sat = 5, Sun = 6
        next += timedelta(days=1)

I used requests for it's easier-to-use API, but you can use urllib2 for this task too if you so wish. 我使用了更容易使用的API requests ,但如果您愿意,也可以使用urllib2来完成此任务。

Use Chrome w/Dev Tools, Firefox w/Firebug or Fiddler to look at the request URL when you hit the download button. 点击下载按钮,使用Chrome w / Dev Tools,Firefox w / Firebug或Fiddler查看请求URL。

(for example, I see this for Nov 22: http://www.tocom.or.jp/data/tick/TOCOMprice_20121122.csv ) (例如,我在11月22日看到这个: http//www.tocom.or.jp/data/tick/TOCOMprice_20121122.csv

You can determine the download link Using the developer menu of your browser, among other things. 您可以确定下载链接使用浏览器的开发人员菜单等。 I use chrome, and I'm shown that the link is 我使用chrome,我发现链接是

http://www.tocom.or.jp/data/souba_d/souba_d_20121126_20121123_0425.csv http://www.tocom.or.jp/data/souba_d/souba_d_20121126_20121123_0425.csv

That URL structure seems pretty straight forward to guess, and another link right on the page: 该URL结构似乎很容易猜到,并且页面上有另一个链接:

http://www.tocom.or.jp/historical/keishiki_souba_d.html http://www.tocom.or.jp/historical/keishiki_souba_d.html

Indicates how to structure pulls. 指示如何构造拉力。 A good bet is just to structure the csv pulls in 5 minute intervals. 一个好的选择就是以5分钟的间隔构建csv拉力。

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM