Extracting Data from a Table in HTML using Selenium and Python

Question

I have this assignment of extracting some items from each row of a table in HTML. I have figured out how to grab the whole table from the web using Selenium with Python. Following is the code for that:

from selenium import webdriver
import time 
import pandas as pd

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")

time.sleep(5) # wait 5 seconds until DOM will load completly
table = mydriver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody')

for row in table.find_elements_by_xpath('./tr'):
    print(row.text)

I am unable to understand the way I can grab specific items from the table itself. Following are the items that I require:

Company Name
PDF Link(if it does not exist, write "No PDF Link")
Received Time
Dessiminated Time
Time Taken
Description

Any help in logic would be helpful. Thanks in Advance.

Answer 1

for tr in mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table//tr'):
    tds = tr.find_elements_by_tag_name('td')
    print ([td.text for td in tds])

Answer 2

I went through a rough time to get this working. I think it works just fine now. Its pretty inefficient though. Following is the code:

from selenium import webdriver
import time 
import pandas as pd
from selenium.common.exceptions import NoSuchElementException

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly

trs = mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody/tr')
del trs[0]

names = []
r_time = []
d_time = []
t_taken = []
desc = []
pdfs = []
codes = []

i = 0
while i < len(trs):
    names.append(trs[i].text)

    l = trs[i].text.split()
    for item in l:
        try:
            code = int(item)
            if code > 100000:
                codes.append(code)
        except:
            pass

    link = trs[i].find_elements_by_tag_name('td')
    pdf_count = 2
    while pdf_count < len(link):
        try:
            pdf = link[pdf_count].find_element_by_tag_name('a')
            pdfs.append(pdf.get_attribute('href'))
        except NoSuchElementException:
            pdfs.append("No PDF")
        pdf_count = pdf_count + 4

    time = trs[i + 1].text.split()
    if len(time) == 5:
        r_time.append("No Time Given")
        d_time.append(time[3] + " " + time[4])
        t_taken.append("No Time Given")
    else:
        r_time.append(time[3] + " " + time[4])
        d_time.append(time[8] + " " + time[9])
        t_taken.append(time[12])

    desc.append(trs[i+2].text)

    i = i + 4

df = pd.DataFrame.from_dict({'Name':names,'Description':desc, 'PDF Link' : pdfs,'Company Code' : codes, 'Received Time' : r_time, 'Disseminated Time' : d_time, 'Time Taken' : t_taken})
df.to_excel('corporate.xlsx', header=True, index=False) #print the data in the excel sheet.

Also, I have added another aspect that was asked, I got the company code in another column as well. Thats the result I get.

Extracting Data from a Table in HTML using Selenium and Python

Question

2 answers

solution1
1 2018-06-18 16:52:09

solution2
0

Extracting Data from a Table in HTML using Selenium and Python

Question

2 answers

solution1 1 2018-06-18 16:52:09

solution2 0

solution1
1 2018-06-18 16:52:09

solution2
0