简体   繁体   English


[英]Python- Downloading a file from a webpage by clicking on a link

I've looked around the internet for a solution to this but none have really seemed applicable here. 我在互联网上四处寻找解决方案,但似乎没有一个适用于此。 I'm writing a Python program to predict the next day's stock price using historical data. 我正在编写一个Python程序,以使用历史数据预测第二天的股价。 I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. 自Yahoo财务提供以来,我不需要自成立以来的所有历史数据,仅需要最近60天左右的时间。 The NASDAQ website provides just the right amount of historical data and I wanted to use that website. 纳斯达克网站仅提供适量的历史数据,我想使用该网站。

What I want to do is, go to a particular stock's profile on NASDAQ. 我要做的是转到纳斯达克的特定股票档案。 For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. 例如:(www.nasdaq.com/symbol/amd/historical),然后单击最底部的“以Excel格式下载此文件”链接。 I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was: 我检查了页面的HTML,以查看是否存在可以与urllib一起使用的实际链接来获取文件,但我得到的只是:

<a id="lnkDownLoad" href="javascript:getQuotes(true);">
                Download this file in Excel Format

No link. 没有链接。 So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. 所以我的问题是,我该如何写一个Python脚本进入给定股票的纳斯达克页面,单击“以excel格式下载文件”链接,然后从中实际下载文件。 Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. 大多数在线解决方案都要求您知道文件存储的URL,但是在这种情况下,我无权访问该文件。 So how do I go about doing this? 那么我该怎么做呢?

  1. Using Chrome, go to View > Developer > Developer Tools 使用Chrome,转到View > Developer > Developer Tools
  2. In this new developer tools UI, change to the Network tab 在这个新的开发人员工具用户界面中,转到“ Network标签
  3. Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity. 导航到需要单击的位置,然后单击the符号以清除所有最近的活动。
  4. Click the link, and see if there was any requests made to the server 单击链接,然后查看是否有对服务器的任何请求
  5. If there was, click it, and see if you can reverse engineer the API of its endpoint 如果存在,请单击它,然后查看是否可以对其端点的API进行反向工程

Please be aware that this may be against the website's Terms of Service! 请注意,这可能违反网站的服务条款!

It appears that BeautifulSoup might be the easiest way to do this. 看来BeautifulSoup可能是执行此操作的最简单方法。 I've made a cursory check that the results of the following script are the same as those that appear on the page. 我已经粗略地检查了以下脚本的结果是否与页面上显示的结果相同。 You would just have to write the results to a file, rather than print them. 您只需要将结果写入文件,而不是打印它们即可。 However, the columns are ordered differently. 但是,列的顺序不同。

import requests
from bs4 import BeautifulSoup

URL = 'http://www.nasdaq.com/symbol/amd/historical'
page = requests.get(URL).text
soup = BeautifulSoup(page, 'lxml')
tableDiv = soup.find_all('div', id="historicalContainer")
tableRows = tableDiv[0].findAll('tr')

for tableRow in tableRows[2:]:
    row = tuple(tableRow.getText().split())
    print ('"%s",%s,%s,%s,%s,"%s"' % row)

Output: 输出:


The script escapes dates and thousands-separated numbers. 该脚本转义日期和千位分隔的数字。

Dig a little bit deeper and find out what js function getQuotes() does. 深入研究,找出js函数getQuotes()作用。 You should get a good clue from that. 您应该从中得到一个很好的线索。

If it all seem too much complicated, then you can always use selenium. 如果看起来太复杂了,那么您可以随时使用硒。 It is used to simulate the browser. 它用于模拟浏览器。 However, it is much slower than using native network calls. 但是,它比使用本机网络调用慢得多。 You can find official documentation here . 您可以在此处找到官方文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM