简体   繁体   English

使用 beautifulsoup 抓取 aria-label 的数据

[英]Data scraping of an aria-label with beautifulsoup

From the following, i am trying to extract the analysts price targets.从以下内容中,我试图提取分析师的价格目标。 I am interested in the information present inside the aria-label.我对 aria-label 中的信息感兴趣。

I tried multiple versions of BeautifulSoup I found online with the following setup:我尝试使用以下设置在网上找到的BeautifulSoup的多个版本:

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'XXXXX'} >> XXXXX replaced with actual
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
  1. The aria-label seems to be between a 'div' and a 'class' , so I tried the following: aria-label 似乎介于'div''class'之间,所以我尝试了以下操作:

     target = soup.find('div', {'class':'Px(10px)'})

Result = None结果 = 无

  1. It is inside a section, so I tried the following:它在一个部分内,所以我尝试了以下操作:

     target = soup.find('section', attrs={'data-test':'price-targets'})

Result = None结果 = 无

  1. Then I tried to go even upper, using the ID:然后我尝试了 go 甚至上层,使用 ID:

     target = soup.find('div', {'id':'mrt-node-Col2-5-QuoteModule'}).find_all('div')[0]

Result = < div data-react-checksum="2049647463" data-reactid="1" data-reactroot="" id="Col2-5-QuoteModule-Proxy">< span data-reactid="2">< /span>< /div>结果 = < div data-react-checksum="2049647463" data-reactid="1" data-reactroot="" id="Col2-5-QuoteModule-Proxy">< span data-reactid="2">< /span>< /div>

Thus, I am getting closer with option 3, but I receive an error when I modify the find_all div index因此,我越来越接近选项 3,但是当我修改find_all div 索引时收到错误

Is there any solution or turnaround to extract the 4 data present in the aria-label?是否有任何解决方案或周转时间来提取 aria-label 中存在的 4 个数据?

The numbers next to 'Low' , 'Current' , 'Average' & 'High' are my target. 'Low''Current''Average''High'旁边的数字是我的目标。

在此处输入图像描述

As selenium might consume time to iterate, I found a second possible solution to my issue which is to get the source code of the page using requests , and search for the data with a combination of json & regex.由于selenium可能会花费时间进行迭代,我找到了第二种可能的解决方案,即使用requests获取页面的源代码,并结合 json 和正则表达式搜索数据。

As @Ann Zen mentioned in commented the website is rendering elements and data dynamically and Beautifulsoup can't handle it alone using Selenium will wait till the time app is loaded and then try to get the element正如@Ann Zen 在评论中提到的那样,该网站正在动态呈现元素和数据, Beautifulsoup无法单独使用Selenium来处理它会等到应用程序加载的时间,然后尝试获取元素

Example web-scraping-with-selenium示例web-scraping-with-selenium

What about yfinance Python package ? yfinance Python package呢? To get analyst price targets to you need to scroll the page to the end and wait until the data is loaded.要获得分析师的价格目标,您需要将页面滚动到末尾并等待数据加载。

In this case, selenium library is used, which allows you to simulate user actions in the browser.在本例中,使用selenium,它允许您在浏览器中模拟用户操作。

Install libraries:安装库:

pip install bs4 lxml selenium webdriver webdriver_manager

Import libraries:导入库:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml

For selenium to work, you need to use ChromeDriver , which can be downloaded manually or using code.要使selenium正常工作,您需要使用ChromeDriver ,可以手动下载或使用代码下载。 In our case, the second method is used.在我们的例子中,使用了第二种方法。 To control the start and stop of ChromeDriver , you need to use Service which will install browser binaries under the hood:要控制ChromeDriver的启动和停止,您需要使用Service来安装浏览器二进制文件:

service = Service(ChromeDriverManager().install())

You should also add options to work correctly:您还应该添加options才能正常工作:

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')
  • --headless - to run Chrome in headless mode. --headless - 以无头模式运行 Chrome。
  • --lang=en - to set the browser language to English. --lang=en - 将浏览器语言设置为英语。
  • user-agent - to act as a "real" user request from the browser by passing it to request headers . user-agent - 通过将其传递给请求标头来充当来自浏览器的“真实”用户请求。Check what's your user-agent .检查你的user-agent是什么

Now we can start webdriver and pass the URL to the get() method :现在我们可以启动webdriver并将URL传递给get()方法

URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

The page scrolling algorithm looks like this:页面滚动算法如下所示:

  1. Find out the initial page height and write the result to the old_height variable.找出初始页面高度并将结果写入old_height变量。
  2. Scroll the page using the script and wait 2 seconds for the data to load.使用脚本滚动页面并等待 2 秒以加载数据。
  3. Find out the new page height and write the result to the new_height variable.找出新的页面高度并将结果写入new_height变量。
  4. If the variables new_height and old_height are equal, then we complete the algorithm, otherwise we write the value of the variable new_height to the variable old_height and return to step 2.如果变量new_heightold_height相等,则我们完成算法,否则我们将变量new_height的值写入变量old_height并返回步骤 2。

Getting the page height and scroll is done by pasting the JavaScript code into the execute_script() method:通过将 JavaScript 代码粘贴到execute_script()方法中来获取页面高度和滚动:

old_height = driver.execute_script("""
    function getHeight() {
        return document.querySelector('#Aside').scrollHeight;
    }
    return getHeight();
""")

while True:
    driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')

    time.sleep(2)

    new_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('#Aside').scrollHeight;
        }
        return getHeight();
    """)

    if new_height == old_height:
        break

    old_height = new_height

We create the soup object and stop the driver:我们创建soup object 并停止驱动程序:

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

Selecting an element by class is not always a good idea because classes can change.通过class选择一个元素并不总是一个好主意,因为类可以改变。 It is more reliable to access attributes .访问属性更可靠。 In this case, I'm accessing the data-test attribute with the value price-targets and then to the div inside it.在这种情况下,我正在访问具有值price-targetsdata-test属性,然后访问其中的div The value of the aria-label attribute is retrieved and printed from the resulting object:从结果 object 中检索并打印aria-label属性的值:

price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')
print(price_targets)

If you want to extract other data, you can see the Scrape Yahoo!如果你想提取其他数据,你可以看到Scrape Yahoo! Finance Home Page with Python blog post, which describes this in detail. Finance Home Page with Python博文,里面对此有详细的介绍。

Code and full example in online IDE : 在线 IDE 中的代码和完整示例

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml

URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'

service = Service(ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

old_height = driver.execute_script("""
    function getHeight() {
        return document.querySelector('#Aside').scrollHeight;
    }
    return getHeight();
""")

while True:
    driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')

    time.sleep(2)

    new_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('#Aside').scrollHeight;
        }
        return getHeight();
    """)

    if new_height == old_height:
        break

    old_height = new_height

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')

print(price_targets)

Output: Output:

Low  122 Current  129.62 Average  174.62 High  214

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM