I have this assignment of extracting some items from each row of a table in HTML. I have figured out how to grab the whole table from the web using Selenium with Python. Following is the code for that:
from selenium import webdriver
import time
import pandas as pd
mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly
table = mydriver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody')
for row in table.find_elements_by_xpath('./tr'):
print(row.text)
I am unable to understand the way I can grab specific items from the table itself. Following are the items that I require:
Company Name
PDF Link(if it does not exist, write "No PDF Link")
Received Time
Dessiminated Time
Time Taken
Description
Any help in logic would be helpful. Thanks in Advance.
for tr in mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table//tr'):
tds = tr.find_elements_by_tag_name('td')
print ([td.text for td in tds])
I went through a rough time to get this working. I think it works just fine now. Its pretty inefficient though. Following is the code:
from selenium import webdriver
import time
import pandas as pd
from selenium.common.exceptions import NoSuchElementException
mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly
trs = mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody/tr')
del trs[0]
names = []
r_time = []
d_time = []
t_taken = []
desc = []
pdfs = []
codes = []
i = 0
while i < len(trs):
names.append(trs[i].text)
l = trs[i].text.split()
for item in l:
try:
code = int(item)
if code > 100000:
codes.append(code)
except:
pass
link = trs[i].find_elements_by_tag_name('td')
pdf_count = 2
while pdf_count < len(link):
try:
pdf = link[pdf_count].find_element_by_tag_name('a')
pdfs.append(pdf.get_attribute('href'))
except NoSuchElementException:
pdfs.append("No PDF")
pdf_count = pdf_count + 4
time = trs[i + 1].text.split()
if len(time) == 5:
r_time.append("No Time Given")
d_time.append(time[3] + " " + time[4])
t_taken.append("No Time Given")
else:
r_time.append(time[3] + " " + time[4])
d_time.append(time[8] + " " + time[9])
t_taken.append(time[12])
desc.append(trs[i+2].text)
i = i + 4
df = pd.DataFrame.from_dict({'Name':names,'Description':desc, 'PDF Link' : pdfs,'Company Code' : codes, 'Received Time' : r_time, 'Disseminated Time' : d_time, 'Time Taken' : t_taken})
df.to_excel('corporate.xlsx', header=True, index=False) #print the data in the excel sheet.
Also, I have added another aspect that was asked, I got the company code in another column as well. Thats the result I get.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.