简体   繁体   English

使用Selenium和Python从HTML中的表中提取数据

[英]Extracting Data from a Table in HTML using Selenium and Python

I have this assignment of extracting some items from each row of a table in HTML. 我的任务是从HTML表格的每一行中提取一些项目。 I have figured out how to grab the whole table from the web using Selenium with Python. 我已经弄清楚了如何使用Selenium和Python从网络上获取整个表格。 Following is the code for that: 以下是该代码:

from selenium import webdriver
import time 
import pandas as pd

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")

time.sleep(5) # wait 5 seconds until DOM will load completly
table = mydriver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody')

for row in table.find_elements_by_xpath('./tr'):
    print(row.text)

I am unable to understand the way I can grab specific items from the table itself. 我无法理解如何从表格本身中获取特定项目。 Following are the items that I require: 以下是我需要的物品:

  1. Company Name 公司名

  2. PDF Link(if it does not exist, write "No PDF Link") PDF链接(如果不存在,请写“无PDF链接”)

  3. Received Time 收到时间

  4. Dessiminated Time 指定时间

  5. Time Taken 所用的时间

  6. Description 描述

Any help in logic would be helpful. 逻辑上的任何帮助都会有所帮助。 Thanks in Advance. 提前致谢。

for tr in mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table//tr'):
    tds = tr.find_elements_by_tag_name('td')
    print ([td.text for td in tds])

I went through a rough time to get this working. 我经历了一段艰难的时期才开始工作。 I think it works just fine now. 我认为现在效果很好。 Its pretty inefficient though. 它的效率很低。 Following is the code: 以下是代码:

from selenium import webdriver
import time 
import pandas as pd
from selenium.common.exceptions import NoSuchElementException

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly

trs = mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody/tr')
del trs[0]

names = []
r_time = []
d_time = []
t_taken = []
desc = []
pdfs = []
codes = []

i = 0
while i < len(trs):
    names.append(trs[i].text)

    l = trs[i].text.split()
    for item in l:
        try:
            code = int(item)
            if code > 100000:
                codes.append(code)
        except:
            pass

    link = trs[i].find_elements_by_tag_name('td')
    pdf_count = 2
    while pdf_count < len(link):
        try:
            pdf = link[pdf_count].find_element_by_tag_name('a')
            pdfs.append(pdf.get_attribute('href'))
        except NoSuchElementException:
            pdfs.append("No PDF")
        pdf_count = pdf_count + 4

    time = trs[i + 1].text.split()
    if len(time) == 5:
        r_time.append("No Time Given")
        d_time.append(time[3] + " " + time[4])
        t_taken.append("No Time Given")
    else:
        r_time.append(time[3] + " " + time[4])
        d_time.append(time[8] + " " + time[9])
        t_taken.append(time[12])

    desc.append(trs[i+2].text)

    i = i + 4

df = pd.DataFrame.from_dict({'Name':names,'Description':desc, 'PDF Link' : pdfs,'Company Code' : codes, 'Received Time' : r_time, 'Disseminated Time' : d_time, 'Time Taken' : t_taken})
df.to_excel('corporate.xlsx', header=True, index=False) #print the data in the excel sheet. 

Also, I have added another aspect that was asked, I got the company code in another column as well. 另外,我还添加了另一个方面的要求,我也在另一列中获得了公司代码。 Thats the result I get. 那就是我得到的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM