简体   繁体   English

从基于下拉菜单更改的数据表中抓取“ li”标签

[英]Scrape 'li' tags from a data table that changes based on drop-down menu

I am trying to scrape data from a data table on this website: [ http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793] 我正在尝试从此网站上的数据表中抓取数据:[ http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793]

The site has multiple tabs, which changes the html (I am working in the 'matchup' tab). 该站点具有多个选项卡,这些选项卡更改了html(我正在“匹配”选项卡中工作)。 Within that matchup tab, there is a drop-down menu that changes the data table that I am trying to access. 在该matchup选项卡中,有一个下拉菜单,用于更改我尝试访问的数据表。 The items in the table that I am trying to access are 'li' tags within an unordered list. 我要访问的表中的项目是无序列表中的“ li”标签。 I just want to scrape the data from the "Overall" category of the drop-down menu. 我只想从下拉菜单的“总体”类别中抓取数据。

I have been unable to access the data that I want. 我一直无法访问所需的数据。 The item that I'm trying to access is coming back as a 'noneType'. 我尝试访问的项目以“ noneType”的形式返回。 Is there a way to do this? 有没有办法做到这一点?

url = "http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-
744793"  
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')

dataList = []
for ultag in soup.find_all('ul', {'class': 'base-list team-stats'}):
    print(ultag)
    for iltag in ultag.find_all('li'):
        dataList.append(iltag.get_text())

So the problem is that the content of the tab you are trying to pull data from is dynamically loaded using React JS. 因此,问题在于,您尝试从中提取数据的选项卡的内容是使用React JS动态加载的。 So you have to use the selenium module in Python to open a browser to click the list element "Matchup" programmatically then get the source after clicking it. 因此,您必须使用Python中的selenium模块来打开浏览器,以编程方式单击列表元素“ Matchup”,然后在单击它后获取源。

On my mac I installed selenium and the chromewebdriver using these instructions: 在我的Mac上,我按照以下说明安装了selenium和chromewebdriver:

https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f

Then signed the python file, so that the OS X firewall doesn't complain to us when trying run it, using these instructions: Add Python to OS X Firewall Options? 然后按照以下说明对python文件签名,以使OS X防火墙在尝试运行它时不会向我们抱怨: 将Python添加到OS X防火墙选项?

Then ran the following python3 code: 然后运行以下python3代码:

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup

# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# Navigate in Chrome to specified page.
driver.get("http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793")

# Find the matchup list element using a css selector and click it.
link = driver.find_element_by_css_selector("li[id='react-tabs-0'").click()

# Wait for content to load
time.sleep(1)

# Get the current page source.
source = driver.page_source

# Parse into soup() the source of the page after the link is clicked and use "html.parser" as the Parser.
soupify = soup(source, 'html.parser')

dataList = []
for ultag in soupify.find_all('ul', {'class': 'base-list team-stats'}):
    print(ultag)
    for iltag in ultag.find_all('li'):
        dataList.append(iltag.get_text())

# We are done with the driver so quit.
driver.quit()

Hope this helps as I noticed this was a similar problem to the one I just solved here - Beautifulsoup doesn't reach a child element 希望这会有所帮助,因为我注意到这与我刚刚在这里解决的问题类似-Beautifulsoup没有达到子元素

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM