简体   繁体   中英

How do I use Python to trigger the download of a file from a website?

I'm trying to set up a script to pull data from a website on a daily basis and I'm having trouble getting Python to actually read the table - I'm not a professional coder. I've tried two methods:

1) Scrape the table (headers, rows, etc) using Beautiful Soup, and

2) Use the website's export with excel button

Here is the precise website: https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200

So far my code is:

#Imports
import requests
import urllib.request
import pandas as pd
from lxml import html
import lxml.html as lh
from bs4 import BeautifulSoup
`URL ='https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200'`

#Create a handle, page, to handle the contents of the website
requests.packages.urllib3.disable_warnings()
page = requests.get(URL, verify=False)

I think the easiest method would be to trigger the "export" function with the

xpath //*[@id="content"]/form/div[2]/div/table/tbody/tr/td[4]/table/tbody/tr/td[1]/a

All help is greatly appreciated!

I would try to identify the API that does 'export to excel' and use that API. You can identify this from the browser's developer tools. For example, here's what Google Chrome's Copy as Curl gives:

curl 'https://scgenvoy.sempra.com/Public/ViewExternalLowOFO.submitLowOfoSaveAs' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' -H 'Origin: https://scgenvoy.sempra.com' -H 'Upgrade-Insecure-Requests: 1' -H 'DNT: 1' -H 'Content-Type: application/x-www-form-urlencoded' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'Referer: https://scgenvoy.sempra.com/index.html' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.9' -H 'Cookie: FAROFFSession=537EB1587E4A063416D5F2206890A2B6.managed2' --data 'FileName=LowOFO05302019Cycle2&Class=com.sempra.krypton.common.saveas.constants.FancyExcelExportType&pageSize=letter&pageOrientation=portrait&HiddenGasFlowDateField=05%2F30%2F2019&HiddenCycleField=2&gasFlowDate=05%2F30%2F2019&cycle=2' --compressed 

The API url is https://scgenvoy.sempra.com/Public/ViewExternalLowOFO.submitLowOfoSaveAs

and the input parameters are:

FileName: LowOFO05302019Cycle2
Class: com.sempra.krypton.common.saveas.constants.FancyExcelExportType
pageSize: letter
pageOrientation: portrait
HiddenGasFlowDateField: 05/30/2019
HiddenCycleField: 2
gasFlowDate: 05/30/2019
cycle: 2

and the request method is POST.

You can now use python requests library or beautifulsoup library to make this request, passing appropriate values for the parameter.

Providing you an idea, not solving the whole thing myself.

Your website is appending dynamic table data with the export button. So basically you need to use Selenium package to handle dynamic data. download selenium web driver as per your browser.

for chrome browser:

http://chromedriver.chromium.org/downloads

Install web driver for chrome browser:

unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

selenium tutorial

https://selenium-python.readthedocs.io/

Export Excel file:

from selenium import webdriver
import time

driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200')
time.sleep(3)
excel_button = driver.find_element_by_xpath("//div[@id='content']/form/div[2]/div/table/tbody/tr/td[4]/table/tbody/tr/td[2]/a")

print(excel_button.click())

where "/usr/bin/chromedriver" chrome web driver path.

Here is the code I got working:

## Input parameters
start_date = '5/28/19'
end_date = '5/31/19'

#### Loops through date range and pulls data
## Date Range ##
datelist = pd.date_range(start=start_date, end=end_date, 
freq='D',dtype='datetime64[ns]')
print(datelist)

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# opens chrome and opens up Gas Envoy
driver =webdriver.Chrome('C:/Users/tmrt/Documents/chromedriver_win32/chromedriver.exe')

driver.get('https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200')

# pause to give time to think load
time.sleep(5)

# Loops through the dates
for d in datelist:
     # Finds Date Box and Date Box Go Button
     date_box = driver.find_element_by_xpath('//*[@id="content"]/form/div[2]/table/tbody/tr/td[1]/table/tbody/tr/td[2]/input')
     date_clicker = driver.find_element_by_xpath('//*[@id="content"]/form/div[2]/table/tbody/tr/td[2]/table/tbody/tr/td/a')

    # Input date into datebox
    date_box.clear()
    date_box.send_keys(d.strftime("%m/%d/%Y"))

    # Click date_box
    date_clicker.click()

    # Pause to allow to load
    time.sleep(5)

    # Clicks download
     csv_button = driver.find_element_by_xpath('//*[@id="content"]/form/div[2]/div/table/tbody/tr/td[4]/table/tbody/tr/td[1]/a')   
    csv_button.click()

driver.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM