简体   繁体   中英

How to fetch this data with Beautiful Soup 4 or lxml?

Here's the website in question:

https://www.gurufocus.com/stock/AAPL

And the part that interests me is this one (it's the GF Score in the upper part of the website):

在此处输入图像描述

I need to extract the strings 'GF Score' and '98/100'.

Firefox Inspector gives me span.t-h6 > span:nth-child(1) as a CSS Selector but I just can't seem to fetch neither the numbers nor the descriptor.

Here's the code that I've used so far to extract the "GF Score" part:

import requests
import bs4 as BeautifulSoup
from lxml import html

req = requests.get('https://www.gurufocus.com/stock/AAPL')

soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)

tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[@id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)

As a result, I'm getting three empty brackets.

The xpath was taken directly out of chrome via the copy function and the nth-child expression in the BS4 part also.

Any suggestions as to what might be at fault here?

Unfortunately get the page using Requests lib impossible, as well as access to the api to which the signature is needed. There is 2 option:

Use API . It's not free, but much more convenient and faster. And second one - Selenium . It's free, but the speed is without fine-tuning the wait element. The second problem is protection - cloudflare. Soon without changing the headers and\or IP you probably ll get a ban. So there is example:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup


def get_gf_score(ticker_symbol: str, timeout=10):
    driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
    try:
        element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
        WebDriverWait(driver, timeout).until(element_present)
        return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
    except TimeoutException:
        print("Timed out waiting for page to load")


tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
    print(ticker, get_gf_score(ticker), sep=': ')

OUTPUT:

AAPL: 98/100
MSFT: 97/100
AMZN: 88/100

One way you could get the desired value is:

  • Make the request to the page - by pretending the request comes from a browser and then, extract the info you need from the JSON object inside the script HTML tag 1 .

NOTE:

1 Please be warned there that I couldn't get the JSON object - This is the JSON result , btw - and extract the value by following the path:

 js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']

So, as alternative ( not a very good one, IMHO, but, works for your purpose, though ), I decided to find certain elements on the JSON/string result and then extract the desired value ( by dividing the string - ie substring ).

Full code:

import requests
from bs4 import BeautifulSoup
import json

geturl = r'https://www.gurufocus.com/stock/AAPL'

getheaders = {
    'Accept': 'text/html; charset=utf-8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
    'Referer': 'https://www.gurufocus.com'
}

s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")

# This is the "<script>" element tag that contains the full JSON object 
# with all the data.
scripts = soup.findAll("script")[1]

# Get only the JSON data: 
js_data = scripts.get_text("", strip=True)

# -- Get the value from the "gf_score" string - by getting its position:

# Part I: is where the "gf_score" begins. 
part_I = js_data.find("gf_score")

# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")

# Build the desired result and print it: 
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)

Result:

GF Score: 98/100

the data is dynamic. I think rank is what you are looking for but the api required authentication. Maybe you can use selenium or playwright to render the page?

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM