简体   繁体   中英

Python Selenium - Scraping a Table from a Dynamic Page

I'm completely new to Python. I want to scrape data from a html table and put it into MS Excel. The website I'm scraping from is dynamic, so I have to select options from 3 drop down boxes to build the table.

Please note that the code below gets me to the website and selects the options I need to build the table.

Please note that the url of this site does not change. It stays the same as the drop down options are selected.

This is what the table looks like once I select the options I need:

Table

Here is a sample of the html for the table:

Sample HTML of Table

My question is on how to read the table with Python and bring the header and contents of the table neatly into MS Excel. The preference would be to maintain the formatting (the font, alternating colors, etc) if possible, but that's not super important.

This is the code I'm using to go to the website and select the options I need from the drop down boxes:

from selenium import webdriver
DRIVER_PATH = 'path to chrome driver'
from selenium.webdriver import chrome
from selenium.webdriver.support.select import Select
driver = webdriver.Chrome(executable_path='path to chrome driver')

#open page
driver.get('url of web page')

#Select drop down box 1 option 
select = Select(driver.find_element_by_id('cboGroup'))
select.select_by_visible_text('Drop down box 1 option')

#Insert wait
import time
time.sleep(1)
#driver.implicitly_wait(10000)

#Select drop down box 2 option
select = Select(driver.find_element_by_id('cboElements'))
select.select_by_visible_text('Drop down box 2 option')

#Insert wait
import time
time.sleep(1)

#Select drop down box 3 option
import datetime
from pandas.tseries.offsets import BDay
ReportDate = datetime.datetime.today() - BDay(1)
NewReportDate = ReportDate.strftime("%m/%d/%Y")
print(NewReportDate)
select = Select(driver.find_element_by_id('cboDelDate'))
select.select_by_visible_text(NewReportDate)

#Insert wait
import time
time.sleep(1)

I've tried using the send keys command to copy/paste the whole page into MS Excel (Ctrl+A, Ctrl+V) but the formatting gets thrown off and it doesn't look right.

I've also tried using Pandas, but I haven't been able to grab the table data.

In order to select the table you can use it's unique id - DataGrid1. You can do this with the following code snippet:

table = driver.find_element_by_xpath("//div[@id='DataGrid1']")

Now you have the table element. After that what you need to do is go row by row. You can do this by finding all elements (row elements) in the table like this:

table_rows = table.find_elements_by_xpath(".//tr")

Note : the starting dot in (".//tr") specifies that you are searching for tr elements within the table element. This way you get only the row elements of the table element and not tr elements within the whole page.

Once you have all row elements I suggest you create an empty list for each column element and append it at the correct place, for example:

list_column_0 = []
list_column_1 = []
......

for current_row in rows:
    column_elements = current_row.find_elements_by_xpath(".//td")
    list_column_0.append(column_elements[0])
    list_column_1.append(column_elements[1])
    .......

In order to save the data in an MS Excel Sheet you need to use a pandas DataFrame object and the easiest way (in my opinion) is to create it via a dictionary:

dict_output = {
    "Name of column 0":list_column_0,
    "Name of column 1":list_column_1,
......
}

import pandas as pd   # you can put this at the top of your code
df = pd.DataFrame.from_dict(dict_output)
df.to_excel("output_file.xlsx", index=False)

To clarify - in order to create a DataFrame object you need to have the same size lists for each element in the dictionary (list_column_0, list_column_1, etc..). When you specify index=False you do not get a column with an extra column with an index number.

Instead of copying and pasting the content, I used soup.find_all function from the BeautifulSoup library to find the table. Then, I used Pandas to create a dataframe from the table and send it to my Excel sheet.

Here is the code I used:

html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all("table")[1]
df = pd.read_html(str(table))[0]
#print(df)
new_table = pd.DataFrame(df)
new_table.to_excel ("<PATH TO EXCEL SHEET", sheet_name="<SHEET NAME>", index=False, header=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM