使用Python从web xhr feed中抓取数据

Question

我试图从这个网页上搜索网球比赛的结果。 特别是我试图得到两个球员的名字，日期/时间和比赛结果。 我有两个问题：

默认情况下，网页不会显示所有匹配项 - 只能通过单击页面底部的“显示更多匹配项”来显示这些匹配项。
当我在美丽的汤中加载html时，数据似乎不存在。 它看起来像是通过某种查询加载（' http://d.flashscore.com/x/feed/f_ '），但我不确定如何直接运行它。

我的代码示例如下：

url="http://www.scoreboard.com/au/tennis/wta-singles/australian-open-2016/results/"

from urllib.request import Request, urlopen
req = Request(url, headers={"X-Fsign": "SW9D1eZo"})
s = urlopen(req,timeout=50).read()
s=urlopen(req, timeout=50).read()
soup=BeautifulSoup(s, "lxml")

match_times=soup.find_all("td", class_="cell_ad time")
players=soup.find_all("span", class_="padl"
results=soup.find_all("td", class_"cell_sa score  bold"
#these all return empty element sets

如何加载页面，所有结果都可见？ 我怎样才能优雅地提取上述数据？

编辑：在建议使用selenium之后，我已经构建了一个函数，它将使用Selenium / Chrome加载页面，然后将html发送到bs4：

def open_url(url):
    try:
        from urllib.request import Request, urlopen
        req = Request(url)
        s = urlopen(req,timeout=20).read()
        driver.get(url)
        try:
            driver.find_element_by_xpath("""//*[@id="tournament-page-results-more"]/tbody/tr/td/a""").click()
            time.sleep(5)
        except:
            print("No more results to show...")
        body=driver.find_element_by_id("fs-results")
        return BeautifulSoup(body.get_attribute("innerHTML"), "lxml")
    except:
        print("Webpage doesn't exist")

这意味着我可以显示所有结果，但点击显示更多按钮。 不幸的是，代码在页面正确加载之前继续运行，因此当我尝试获取包含结果的所有行时：

matches=[]
soup=open_url(url)
rrows=soup.find_all("tr")
for rrow in rrows:
    if rrow.attrs['class']!=['event_round']:
        matches.append(rrow)

它只获得最初可见的结果。 我怎样才能解决这个问题？

Answer 1

这个页面使用JavaScript来获取数据，如果你使用urllib ，你将只获得没有数据的html代码。

使用Selenium来刮取JS页面。

使用Python从web xhr feed中抓取数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-01-17 02:59:56

使用Python从web xhr feed中抓取数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-01-17 02:59:56

解决方案1
0 已采纳 2017-01-17 02:59:56