简体   繁体   English

使用Selenium(Python)抓取网站上的所有工具提示吗?

[英]Scraping all tooltips in a website with Selenium(Python)?

I am currently attempting to scrape this website https://schedule.townsville-port.com.au/ 我目前正在尝试抓取此网站https://schedule.townsville-port.com.au/

I would like to scrape the text in all the individual tooltips. 我想将所有单独的工具提示中的文字都刮掉。

Here is what the html for the typical element I have to hover looks like 这是我必须悬停的典型元素的html看起来像

<div event_id="55591" class="dhx_cal_event_line past_event" style="position:absolute; top:2px; height: 42px; left:1px; width:750px;"><div> 

Here is what the typical html for the tooltip looks like 这是工具提示的典型html外观

<div class="dhtmlXTooltip tooltip" style="visibility: visible; left: 803px; bottom:74px;

I have tried various combinations such as attempting to scrape the tooltips directly and also attempting to scrape the html by hovering over where I need to hover. 我尝试了各种组合,例如尝试直接刮取工具提示,还尝试通过将鼠标悬停在需要悬停的位置来刮取html。

tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")

tool_tips=driver.find_elements_by_xpath("//div[@class=dhx_cal_event_line past_event]")

tool_tips=driver.find_element_by_css_selector("dhx_cal_event_line past_event")

I have also attempted the same code with "dhtmlXTooltip tooltip" instead of "dhx_cal_event_line past_event" 我也尝试使用“ dhtmlXTooltip工具提示”而不是“ dhx_cal_event_line past_event”使用相同的代码

I really don't understand why. 我真的不明白为什么。

tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")

Doesn't work. 不起作用

Can Beautifulsoup be used to tackle this? 可以使用Beautifulsoup解决此问题吗? Since the html is dynamic and changing? 既然html是动态的并且正在变化?

If you open Network tab in Chrome DevTools and filter by XHR you can see that the website makes a request to http://schedule.townsville-port.com.au/spotschedule.php . 如果您在Chrome DevTools中打开“网络”标签,然后按XHR进行过滤,您会看到该网站向http://schedule.townsville-port.com.au/spotschedule.php发出了请求。

from bs4 import BeautifulSoup
import requests

url = 'http://schedule.townsville-port.com.au/spotschedule.php'
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.text, 'xml')

transports = {}
events = soup.find_all('event')

for e in events:
    transport_id = e['id']
    transport = {child.name: child.text for child in e.children}
    transports[transport_id] = transport

import pprint
pprint.pprint(transports)

Output: 输出:

{'48165': {'IMO': '8201480',
       'app_rec': 'Approved',
       'cargo': 'Passenger Vessel (Import)',
       'details': 'Inchcape Shipping Services Pty Limited',
       'duration': '8',
       'end_date': '2018-02-17 14:03:00.000',
       'sectionID': '10',
       'start_date': '2018-02-17 06:44:00.000',
       'text': 'ARTANIA',
       'visit_id': '19109'},
 ...
}

The only way I found to get rid off SSLError was to disable certificate verification with verify=False , you can read more about it here . 我发现摆脱SSLError的唯一方法是使用verify=False禁用证书验证,您可以在此处了解更多信息。

Notice that start_date and end_date are UTC times, so you can either specify timeshift query param: 注意,开始start_dateend_date是UTC时间,因此您可以指定timeshift查询参数:

import time

utc_offset = -time.localtime().tm_gmtoff // 60  # in minutes    
url = f'http://schedule.townsville-port.com.au/spotschedule.php?timeshift={utc_offset}'

or convert dates and store them as datetime objects (you can read about converting time from UTC to your local timezone here ). 或转换日期并将其存储为datetime对象(您可以在此处阅读有关将时间从UTC转换为本地时区的信息 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM