简体   繁体   中英

How to extract Date using Selenium Webdriver

I have been trying hours to sort this out but unable to do so.

Here is my script using Selenium Webdriver in Python, trying to extract title, date, and link. I am able to extract the title and link. However, I am stuck at extracting the date. Could someone please help me with this. Much appreciated your response.

import selenium.webdriver
import pandas as pd

frame=[]

url = "https://www.oric.gov.au/publications/media-releases"

driver = selenium.webdriver.Chrome("C:/Users/[Computer_Name]/Downloads/chromedriver.exe")
driver.get(url)

all_div = driver.find_elements_by_xpath('//div[contains(@class, "ui-accordion-content")]')

for div in all_div:
    all_items = div.find_elements_by_tag_name("a")

    for item in all_items:
        title = item.get_attribute('textContent')
        link = item.get_attribute('href')
        date = 

        frame.append({
            'title': title,
            'date': date,
            'link': link,
        })

dfs = pd.DataFrame(frame)
dfs.to_csv('myscraper.csv',index=False,encoding='utf-8-sig')

Here is the html I am interested in:

<div id="ui-accordion-1-panel-0" ...>
      
  <div class="views-field views-field-title">        
    <span class="field-content">
      <a href="/publications/media-release/ngadju-corporation-emerges-special-administration-stronger">
        Ngadju corporation emerges from special administration stronger
      </a>
    </span> 
  </div>  
  <div class="views-field views-field-field-document-media-release-no"> 
    <div class="field-content"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-07-31T00:00:00+10:00">
    31 July 2020
    </span> (MR2021-06)</div>  
  </div>  
</div>
      

...

I'd get all rows first.

from pprint import pprint

import selenium.webdriver

frame = []

url = "https://www.oric.gov.au/publications/media-releases"

driver = selenium.webdriver.Chrome()
driver.get(url)

divs = driver.find_elements_by_css_selector('div.ui-accordion-content')
for div in divs:
    rows = div.find_elements_by_css_selector('div.views-row')
    for row in rows:
        item = row.find_element_by_tag_name('a')
        title = item.get_attribute('textContent')
        link = item.get_attribute('href')
        date = row.find_element_by_css_selector(
            'span.date-display-single').get_attribute('textContent')
        frame.append({
            'title': title,
            'date': date,
            'link': link,
        })

driver.quit()

pprint(frame)
print(len(frame))

Ok just search for the <span > with the property dc:date , save it in a WebElement dateElement and take its text dateElement.text . That's your date as string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM