I am currently attempting to scrape this website https://schedule.townsville-port.com.au/
I would like to scrape the text in all the individual tooltips.
Here is what the html for the typical element I have to hover looks like
<div event_id="55591" class="dhx_cal_event_line past_event" style="position:absolute; top:2px; height: 42px; left:1px; width:750px;"><div>
Here is what the typical html for the tooltip looks like
<div class="dhtmlXTooltip tooltip" style="visibility: visible; left: 803px; bottom:74px;
I have tried various combinations such as attempting to scrape the tooltips directly and also attempting to scrape the html by hovering over where I need to hover.
tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")
tool_tips=driver.find_elements_by_xpath("//div[@class=dhx_cal_event_line past_event]")
tool_tips=driver.find_element_by_css_selector("dhx_cal_event_line past_event")
I have also attempted the same code with "dhtmlXTooltip tooltip" instead of "dhx_cal_event_line past_event"
I really don't understand why.
tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")
Doesn't work.
Can Beautifulsoup be used to tackle this? Since the html is dynamic and changing?
If you open Network tab in Chrome DevTools and filter by XHR you can see that the website makes a request to http://schedule.townsville-port.com.au/spotschedule.php
.
from bs4 import BeautifulSoup
import requests
url = 'http://schedule.townsville-port.com.au/spotschedule.php'
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.text, 'xml')
transports = {}
events = soup.find_all('event')
for e in events:
transport_id = e['id']
transport = {child.name: child.text for child in e.children}
transports[transport_id] = transport
import pprint
pprint.pprint(transports)
Output:
{'48165': {'IMO': '8201480',
'app_rec': 'Approved',
'cargo': 'Passenger Vessel (Import)',
'details': 'Inchcape Shipping Services Pty Limited',
'duration': '8',
'end_date': '2018-02-17 14:03:00.000',
'sectionID': '10',
'start_date': '2018-02-17 06:44:00.000',
'text': 'ARTANIA',
'visit_id': '19109'},
...
}
The only way I found to get rid off SSLError
was to disable certificate verification with verify=False
, you can read more about it here .
Notice that start_date
and end_date
are UTC times, so you can either specify timeshift
query param:
import time
utc_offset = -time.localtime().tm_gmtoff // 60 # in minutes
url = f'http://schedule.townsville-port.com.au/spotschedule.php?timeshift={utc_offset}'
or convert dates and store them as datetime
objects (you can read about converting time from UTC to your local timezone here ).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.