简体   繁体   English

在 Selenium 中抓取 Javascript 呈现的网页元素

[英]Scraping Javascript rendered elements of a webpage in Selenium

I am looking to scrape the data that feeds the SVG elements of this page:我正在寻找提供此页面 SVG 元素的数据:

https://www.beinsports.com/au/livescores/match-center/2019/23/1074885 https://www.beinsports.com/au/livescores/match-center/2019/23/1074885

The page appears to be Javascript rendered, so traditional uses of BeautifulSoup in Python are not working.该页面似乎是 Javascript 呈现的,因此 Python 中 BeautifulSoup 的传统用法不起作用。 I have refreshed in the Inspect Network XHR and it does not appear the page stores the data in JSON either.我已经在 Inspect Network XHR 中进行了刷新,但它也没有显示该页面将数据存储在 JSON 中。 However, when refreshing the JS page in the Network, I see F24_8.js, which in the preview shows exactly what I would want to capture that is feeding the SVG elements:但是,当刷新网络中的 JS 页面时,我看到 F24_8.js,它在预览中显示了我想要捕获的内容,它正在提供 SVG 元素:

F24_8.js

Is there a way to run a script from selenium as an example to mimic that javascript rendering and retrieve the backend data at issue here?有没有办法从 selenium 运行脚本作为示例来模拟 javascript 渲染并在此处检索有问题的后端数据?

Per a request in the comments, I've included a script below that worked against a similar page that is no longer supported by the domain.根据评论中的请求,我在下面包含了一个脚本,该脚本适用于该域不再支持的类似页面。 In that case, executing the script to serialize the XML was more straightforward given the existence of the XML script, and that page did not display the row-level detail but the script fed the row-level data that was used in the rendered tools:在这种情况下,考虑到 XML 脚本的存在,执行脚本序列化 XML 更加直接,并且该页面不显示行级详细信息,但脚本提供了在渲染工具中使用的行级数据:

from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
import selenium
from selenium import webdriver
import re
import math
import time

games=[]

browser = webdriver.PhantomJS()
browser.get("http://www.squawka.com/match-results")
WebDriverWait(browser,10)

mySelect = Select(browser.find_element_by_id("league-filter-list"))
mySelect.select_by_visible_text("German Bundesliga")

seasons=['Season 2012/2013','Season 2013/2014','Season 2014/2015','Season 2015/2016','Season 2016/2017','Season 2017/2018']
for season in seasons:
    nextSelect = Select(browser.find_element_by_id("league-season-list"))
    nextSelect.select_by_visible_text(season)

    source=browser.page_source
    soup=BeautifulSoup(source,'html.parser')
    games.extend([a.get('href') for a in soup.find_all('a',attrs={'href':re.compile('matches')})])

    pages=math.ceil(float(soup.find('span',{'class':'displaying-num'}).get_text().split('of')[-1].strip())/30)
    for page in range(2,int(pages)+1):
        browser.find_element_by_xpath('//a[contains(@href,"pg='+str(page)+'")]').click()
        source=browser.page_source
        soup=BeautifulSoup(source,'html.parser')
        games.extend([a.get('href') for a in soup.find_all('a',attrs={'href':re.compile('matches')})])

    print '---------\n'+season+' Games Appended'

import pandas as pd
import numpy as np
import lxml.etree as etree

frames=[]
count=0
for game in g2:

    try:
        url = game

        browser = webdriver.PhantomJS()
        browser.get(url)

        time.sleep(10)

        page = browser.execute_script('return new XMLSerializer().serializeToString(squawkaDp.xml);')
        root = etree.XML(page.encode('utf-8'))

        #events
        gm=pd.DataFrame()
        for f in root.iter('filters'):
            for a in f:
                for event in a.iter('event'):

                    d=event.attrib
                    records=dict((x,[y]) for x,y in d.items())
                    new_records=dict((x.tag,[x.text]) for x in event)

                    r=pd.DataFrame(records)
                    nr=pd.DataFrame(new_records)
                    j=r.join(nr)
                    j['category']=a.tag

                    gm=gm.append(j)

I recognize the incompleteness of the script but the remaining details are not necessary to the question at hand.我认识到脚本的不完整性,但剩余的细节对于手头的问题不是必需的。

If you use Selenium, you can use driver.execute_script("your script here") to run Javascript on the page.如果您使用 Selenium,您可以使用driver.execute_script("your script here")在页面上运行 Javascript。

I have only used this with very short scripts such as arguments[0].click();我只将它与非常短的脚本一起使用,例如arguments[0].click(); , with success, but I'm not sure how this will work for longer scripts. ,成功,但我不确定这将如何适用于更长的脚本。

driver = new webdriver.Chrome()

driver.execute_script("your script here")

You can also run scripts against WebElements, using the arguments[0] in Javascript.您还可以使用 Javascript 中的arguments[0]针对 WebElement 运行脚本。

my_element = driver.find_element_by_xpath("//div[text()='someText']")

driver.execute_script("arguments[0].click();", my_element)

This will pass my_element into the JS function as this .这会将my_element作为this传递到 JS function 中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM