在 Selenium 中抓取 Javascript 呈现的网页元素

Question

我正在寻找提供此页面 SVG 元素的数据：

https://www.beinsports.com/au/livescores/match-center/2019/23/1074885

该页面似乎是 Javascript 呈现的，因此 Python 中 BeautifulSoup 的传统用法不起作用。 我已经在 Inspect Network XHR 中进行了刷新，但它也没有显示该页面将数据存储在 JSON 中。 但是，当刷新网络中的 JS 页面时，我看到 F24_8.js，它在预览中显示了我想要捕获的内容，它正在提供 SVG 元素：

有没有办法从 selenium 运行脚本作为示例来模拟 javascript 渲染并在此处检索有问题的后端数据？

根据评论中的请求，我在下面包含了一个脚本，该脚本适用于该域不再支持的类似页面。 在这种情况下，考虑到 XML 脚本的存在，执行脚本序列化 XML 更加直接，并且该页面不显示行级详细信息，但脚本提供了在渲染工具中使用的行级数据：

from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
import selenium
from selenium import webdriver
import re
import math
import time

games=[]

browser = webdriver.PhantomJS()
browser.get("http://www.squawka.com/match-results")
WebDriverWait(browser,10)

mySelect = Select(browser.find_element_by_id("league-filter-list"))
mySelect.select_by_visible_text("German Bundesliga")

seasons=['Season 2012/2013','Season 2013/2014','Season 2014/2015','Season 2015/2016','Season 2016/2017','Season 2017/2018']
for season in seasons:
    nextSelect = Select(browser.find_element_by_id("league-season-list"))
    nextSelect.select_by_visible_text(season)

    source=browser.page_source
    soup=BeautifulSoup(source,'html.parser')
    games.extend([a.get('href') for a in soup.find_all('a',attrs={'href':re.compile('matches')})])

    pages=math.ceil(float(soup.find('span',{'class':'displaying-num'}).get_text().split('of')[-1].strip())/30)
    for page in range(2,int(pages)+1):
        browser.find_element_by_xpath('//a[contains(@href,"pg='+str(page)+'")]').click()
        source=browser.page_source
        soup=BeautifulSoup(source,'html.parser')
        games.extend([a.get('href') for a in soup.find_all('a',attrs={'href':re.compile('matches')})])

    print '---------\n'+season+' Games Appended'

import pandas as pd
import numpy as np
import lxml.etree as etree

frames=[]
count=0
for game in g2:

    try:
        url = game

        browser = webdriver.PhantomJS()
        browser.get(url)

        time.sleep(10)

        page = browser.execute_script('return new XMLSerializer().serializeToString(squawkaDp.xml);')
        root = etree.XML(page.encode('utf-8'))

        #events
        gm=pd.DataFrame()
        for f in root.iter('filters'):
            for a in f:
                for event in a.iter('event'):

                    d=event.attrib
                    records=dict((x,[y]) for x,y in d.items())
                    new_records=dict((x.tag,[x.text]) for x in event)

                    r=pd.DataFrame(records)
                    nr=pd.DataFrame(new_records)
                    j=r.join(nr)
                    j['category']=a.tag

                    gm=gm.append(j)

我认识到脚本的不完整性，但剩余的细节对于手头的问题不是必需的。

Answer 1

如果您使用 Selenium，您可以使用driver.execute_script("your script here")在页面上运行 Javascript。

我只将它与非常短的脚本一起使用，例如arguments[0].click(); ，成功，但我不确定这将如何适用于更长的脚本。

driver = new webdriver.Chrome()

driver.execute_script("your script here")

您还可以使用 Javascript 中的arguments[0]针对 WebElement 运行脚本。

my_element = driver.find_element_by_xpath("//div[text()='someText']")

driver.execute_script("arguments[0].click();", my_element)

这会将my_element作为this传递到 JS function 中。

在 Selenium 中抓取 Javascript 呈现的网页元素

问题描述

1 个解决方案

解决方案1
0 2019-10-16 15:07:45

在 Selenium 中抓取 Javascript 呈现的网页元素

问题描述

1 个解决方案

解决方案1 0 2019-10-16 15:07:45

解决方案1
0 2019-10-16 15:07:45