From the following, i am trying to extract the analysts price targets. I am interested in the information present inside the aria-label.
I tried multiple versions of BeautifulSoup
I found online with the following setup:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'XXXXX'} >> XXXXX replaced with actual
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
The aria-label seems to be between a 'div'
and a 'class'
, so I tried the following:
target = soup.find('div', {'class':'Px(10px)'})
Result = None
It is inside a section, so I tried the following:
target = soup.find('section', attrs={'data-test':'price-targets'})
Result = None
Then I tried to go even upper, using the ID:
target = soup.find('div', {'id':'mrt-node-Col2-5-QuoteModule'}).find_all('div')[0]
Result = < div data-react-checksum="2049647463" data-reactid="1" data-reactroot="" id="Col2-5-QuoteModule-Proxy">< span data-reactid="2">< /span>< /div>
Thus, I am getting closer with option 3, but I receive an error when I modify the find_all
div index
Is there any solution or turnaround to extract the 4 data present in the aria-label?
The numbers next to 'Low'
, 'Current'
, 'Average'
& 'High'
are my target.
As selenium
might consume time to iterate, I found a second possible solution to my issue which is to get the source code of the page using requests
, and search for the data with a combination of json & regex.
As @Ann Zen mentioned in commented the website is rendering elements and data dynamically and Beautifulsoup
can't handle it alone using Selenium
will wait till the time app is loaded and then try to get the element
Example web-scraping-with-selenium
What about yfinance
Python package ? To get analyst price targets to you need to scroll the page to the end and wait until the data is loaded.
In this case, selenium
library is used, which allows you to simulate user actions in the browser.
Install libraries:
pip install bs4 lxml selenium webdriver webdriver_manager
Import libraries:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml
For selenium
to work, you need to use ChromeDriver
, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver
, you need to use Service
which will install browser binaries under the hood:
service = Service(ChromeDriverManager().install())
You should also add options
to work correctly:
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')
--headless
- to run Chrome in headless mode. --lang=en
- to set the browser language to English. user-agent
- to act as a "real" user request from the browser by passing it to request headers . Check what's your user-agent
. Now we can start webdriver
and pass the URL
to the get()
method :
URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)
The page scrolling algorithm looks like this:
old_height
variable.new_height
variable.new_height
and old_height
are equal, then we complete the algorithm, otherwise we write the value of the variable new_height
to the variable old_height
and return to step 2. Getting the page height and scroll is done by pasting the JavaScript code into the execute_script()
method:
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
We create the soup
object and stop the driver:
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
Selecting an element by class is not always a good idea because classes can change. It is more reliable to access attributes . In this case, I'm accessing the data-test
attribute with the value price-targets
and then to the div
inside it. The value of the aria-label
attribute is retrieved and printed from the resulting object:
price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')
print(price_targets)
If you want to extract other data, you can see the Scrape Yahoo! Finance Home Page with Python blog post, which describes this in detail.
Code and full example in online IDE :
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml
URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')
driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')
print(price_targets)
Output:
Low 122 Current 129.62 Average 174.62 High 214
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.