简体   繁体   中英

Scraping all tooltips in a website with Selenium(Python)?

I am currently attempting to scrape this website https://schedule.townsville-port.com.au/

I would like to scrape the text in all the individual tooltips.

Here is what the html for the typical element I have to hover looks like

<div event_id="55591" class="dhx_cal_event_line past_event" style="position:absolute; top:2px; height: 42px; left:1px; width:750px;"><div> 

Here is what the typical html for the tooltip looks like

<div class="dhtmlXTooltip tooltip" style="visibility: visible; left: 803px; bottom:74px;

I have tried various combinations such as attempting to scrape the tooltips directly and also attempting to scrape the html by hovering over where I need to hover.

tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")

tool_tips=driver.find_elements_by_xpath("//div[@class=dhx_cal_event_line past_event]")

tool_tips=driver.find_element_by_css_selector("dhx_cal_event_line past_event")

I have also attempted the same code with "dhtmlXTooltip tooltip" instead of "dhx_cal_event_line past_event"

I really don't understand why.

tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")

Doesn't work.

Can Beautifulsoup be used to tackle this? Since the html is dynamic and changing?

If you open Network tab in Chrome DevTools and filter by XHR you can see that the website makes a request to http://schedule.townsville-port.com.au/spotschedule.php .

from bs4 import BeautifulSoup
import requests

url = 'http://schedule.townsville-port.com.au/spotschedule.php'
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.text, 'xml')

transports = {}
events = soup.find_all('event')

for e in events:
    transport_id = e['id']
    transport = {child.name: child.text for child in e.children}
    transports[transport_id] = transport

import pprint
pprint.pprint(transports)

Output:

{'48165': {'IMO': '8201480',
       'app_rec': 'Approved',
       'cargo': 'Passenger Vessel (Import)',
       'details': 'Inchcape Shipping Services Pty Limited',
       'duration': '8',
       'end_date': '2018-02-17 14:03:00.000',
       'sectionID': '10',
       'start_date': '2018-02-17 06:44:00.000',
       'text': 'ARTANIA',
       'visit_id': '19109'},
 ...
}

The only way I found to get rid off SSLError was to disable certificate verification with verify=False , you can read more about it here .

Notice that start_date and end_date are UTC times, so you can either specify timeshift query param:

import time

utc_offset = -time.localtime().tm_gmtoff // 60  # in minutes    
url = f'http://schedule.townsville-port.com.au/spotschedule.php?timeshift={utc_offset}'

or convert dates and store them as datetime objects (you can read about converting time from UTC to your local timezone here ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM