简体   繁体   中英

How to read data from dynamic website faster in selenium

I got a few dynamic websites (football live bets). There's no API I'm reading all of them in selenium. I've got infinite loop and finding elements every time.

while True:
    elements = self.driver.find_elements_by_xpath(games_path)
    for e in elements:
        match = Match()
        match.betting_opened = len(e.find_elements_by_class_name('no_betting_odds')) == 0

The problem is it's one hundred times slower than I need it to be.

What's the alternative to this? Any other library or how to speed it up with Selenium?

One of websites I'm scraping https://www.betcris.pl/zaklady-live#/Soccer

The pice of code of yours has a while True loop without a break . That is an implemenation of an infinite loop. From a short snipplet I can not tell if is this the root cause of your "infinite loop" issue, but may be so, check if you have any break statements inside your while loop.

As for the other part of your question: I am not sure how you measure performance of an infinite loop, but there is a way to speed up parsing pages with selenium: not using selenium. Grab a snapshot from the page and use that for evaluating states, values and stuff.

import lxml.html

page_snapshot = lxml.html.document_fromstring(self.driver.page_source)
games = page_snapshot.xpath(games_path)

This approach is about 2 magnitudes faster than querying via selenium api. Grab the page once, parse the hell out of it real quick and grab the page again later if you want to. If you want to just read stuff, you don't need webelements at all, just the tree of data. To interact with elements you'll need the webelement of course with selenium, but to get values and states, a snapshot may be sufficient.

Or what you could do with selenium only: add the 'no_betting_odds' to the games_path xpath. It seems to me that you want to grab those elements which do not have a 'no_betting_odds' class. Then just add the './/*[not contains(@class, "no_betting_odds")]' to the games_path (which you did not share so I can't update).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM