简体   繁体   English

Python:如何从页面下载Excel文件

[英]Python: How to download Excel file from page

  1. Go to this url https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15 (username = TrickyBen | password = TrickyBen123) 转到此URL https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15 (用户名= TrickyBen |密码= TrickyBen123)
  2. Notice that there is a Download Excel button (in Red) 请注意,有一个“下载Excel”按钮(红色)
  3. I want to download the excel file and turn it into a pandas dataframe. 我想下载Excel文件并将其转换为熊猫数据框。 I want to do it programatically (ie. from the script, not by manually clicking around the website). 我想以编程方式进行此操作(即从脚本中进行操作,而不是通过手动单击网站来进行操作)。 How would I do this? 我该怎么做?

This code will get you logged in as TrickyBen, and make a request to the website API... 此代码将使您以TrickyBen登录,并向网站API发出请求...

import requests from lxml import html from requests import Session import pandas as pd import shutil 从lxml导入请求从请求导入html从会话导入会话导入熊猫作为pd导入关闭

raceSession = Session()

LoginDetails = {'login': 'TrickyBen', 'password': 'TrickyBen123'}

LoginUrl = 'https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15/horsebase1.php'
LoginPost = raceSession.post(LoginUrl, data=LoginDetails)

RaceUrl = 'https://www.horseracebase.com/excelresults.php'
RaceDataDetails =  {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"}

PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"}
Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders)

Table = pd.read_table(Response.text)

Table.to_csv('blahblah.csv')

If you inspect element, you'll notice that the relevant element looks like this... 如果检查元素,您会注意到相关元素看起来像这样...

<form action="excelresults.php" method="post">
    <input type="hidden" name="user" value="41495">
    <input type="hidden" name="racedate" value="2005-3-15">
    <input type="submit" class="downloadbutton" value="Excel">
</form>

I get this error message... 我收到此错误消息...

Traceback (most recent call last):
  File "/Users/Alex/Desktop/DateTest/hrpull.py", line 20, in <module>
    Table = pd.read_table(Response.text)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1213, in __init__
self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 358, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)
  File "pandas/parser.pyx", line 628, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6861)
IOError: File race_date race_time   track   race_name       race_restrictions_age   race_class  major   race_distance   prize_money     going_description   number_of_runners   place   distbt  horse_name  stall       trainer horse_age   jockey_name jockeys_claim   pounds  odds    fav     official_rating comptime    TotalDstBt  MedianOR    Dist_Furlongs       placing_numerical   RCode   BFSP    BFSP_Place  PlcsPaid    BFPlcsPaid      Yards   RailMove    RaceType    
"2005-03-15"    "14:00:00"  "Cheltenham"    "Letheby & Christopher Supreme Novices Hurdle " "4yo+"  "Class 1"   "Grade 1"   "2m˝f " "58000" "Good"  "20"    "1st"       "Arcalis"   "0" "Johnson, J Howard" "5" "Lee, G"    "0" "161"   "21"        "136"   "3 mins 53.00s"     "121.5" "16.5"  "1" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"
"2005-03-15"    "14:00:00"  "Cheltenham"    "Letheby & Christopher Supreme Novices Hurdle " "4yo+"  "Class 1"   "Grade 1"   "2m˝f " "58000" "Good"  "20"    "2nd"   "6" "Wild Passion (GER)"    "0" "Meade, Noel"   "5" "Carberry, P"   "0" "161"   "11"        "0" "3 mins 53.00s" "6" "121.5" "16.5"  "2" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"

I'm thinking that you can see the data that you want to download in another web page, for example, by clicking on "My Systems (v4)". 我认为您可以在另一个网页上看到要下载的数据,例如,通过单击“我的系统(v4)”。 If you can do that, then you can use urllib.request.urlretrieve to download that page. 如果可以这样做,则可以使用urllib.request.urlretrieve下载该页面。 And then you can use html.parser.HTMLParser to parse the data and do with as you wish. 然后,您可以使用html.parser.HTMLParser解析数据并根据需要进行处理。

If you would look at the api being called on the form action, you'll see that you've to make a post request to this url : 如果您查看在表单操作中被调用的api,您将看到必须对此URL进行发布请求:

https://www.horseracebase.com/excelresults.php

with following parameters: 具有以下参数:

data = {
    "user": "41495", # looks like this varies with login, so update in case you change your login id
    "racedate": "2005-3-15",
    "downloadbutton": "Excel"
}

you can do something like this: 您可以执行以下操作:

response = raceSession.post(reqUrl, json=data)

If this doesn't work, try adding headers to the request like: headers=postHeaders . 如果这样不起作用,请尝试将标头添加到请求中,例如: headers=postHeaders For ex. 对于前。 you should set the content type header in this case, as you're sending form encoded data, so: 在这种情况下,您应该在发送表单编码数据时设置内容类型标头,因此:

headers = {"Content-Type": "application/x-www-form-urlencoded"} 

Read this for more info on how to save the excel to a file 阅读此内容以获取有关如何将excel保存到文件的更多信息。

Here's the response for this request in Postman, so looks like you won't need any additional headers except the content-type : 这是Postman中对此请求的响应,因此看起来您不需要content-type之外的任何其他标头:

在此处输入图片说明

EDIT 编辑

This is what you need to do: 这是您需要做的:

raceSession = Session()

RaceUrl = 'https://www.horseracebase.com/excelresults.php'
RaceDataDetails =  {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"}

PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"}
Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders)
# from StringIO import StringIO #for python 2.x
#import StringIO #for python 3.x
Table = pd.read_table(StringIO(Response.text)) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM