使用python和selenium抓取内容

Question

我想从此网站中提取所有联赛名称（例如英格兰超级联赛，苏格兰英超等）。https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1

从Chrome / Firefox使用检查器工具，我可以看到它们位于以下位置：

<span>England Premier League</span>

所以我尝试了这个

from lxml import html

from selenium import webdriver

session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
tree = html.fromstring(session.page_source)
leagues = tree.xpath('//span/text()')
print(leagues)

不幸的是，这没有返回预期的结果:-(

对我来说，网站似乎具有不同的框架，而我从错误的框架中提取内容。

任何人都可以在这里帮助我或为我指明正确的方向吗？ 作为替代方案，如果有人知道如何通过其api提取信息，那么显然这将是更好的解决方案。

任何帮助深表感谢。 谢谢！

Answer 1

希望您正在寻找这样的东西：

from selenium import webdriver
import  bs4, time

driver = webdriver.Chrome()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'


driver.get(url)
driver.maximize_window()
# sleep is given so that JS populate data in this time
time.sleep(10)
pSource= driver.page_source

soup = bs4.BeautifulSoup(pSource, "html.parser")


for data in soup.findAll('div',{'class':'eventWrapper'}):
    for res in data.find_all('span'):
        print res.text

它将打印以下数据：

Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League

唯一的问题是其打印结果设置了两次

Answer 2

初始页面源中缺少必需的内容。 它动态来自https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2

为了获得此内容，可以使用ExplicitWait ，如下所示：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
WebDriverWait(session, 10).until(EC.presence_of_element_located((By.ID, 'Splash')))

for collapsed in session.find_elements_by_xpath('//h3[contains(@class, "collapsed")]'):
    collapsed.location_once_scrolled_into_view
    collapsed.click()

for event in session.find_elements_by_xpath('//div[contains(@class, "eventWrapper")]//span'):
    print(event.text)

使用python和selenium抓取内容

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-09-20 11:02:50

解决方案2
1 2017-09-20 11:19:03

使用python和selenium抓取内容

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-09-20 11:02:50

解决方案2 1 2017-09-20 11:19:03

解决方案1
2 已采纳 2017-09-20 11:02:50

解决方案2
1 2017-09-20 11:19:03