如何使用 Beautiful Soup 4 或 lxml 获取这些数据？

Question

这是有问题的网站：

https://www.gurufocus.com/stock/AAPL

我感兴趣的部分是这个（它是网站上部的 GF Score）：

我需要提取字符串“GF Score”和“98/100”。

Firefox Inspector 给我 span.t-h6 > span:nth-child(1) 作为 CSS 选择器，但我似乎既无法获取数字也无法获取描述符。

这是我到目前为止用来提取“GF Score”部分的代码：

import requests
import bs4 as BeautifulSoup
from lxml import html

req = requests.get('https://www.gurufocus.com/stock/AAPL')

soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)

tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[@id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)

结果，我得到了三个空括号。

xpath 是通过复制 function 和 BS4 部分中的 nth-child 表达式直接从 chrome 中取出的。

关于这里可能有什么问题的任何建议？

Answer 1

不幸的是无法使用Requests lib获取页面，并且无法访问需要签名的api。 有2个选项：

使用API 。 它不是免费的，但更方便、更快捷。 第二个 - Selenium 。 它是免费的，但速度是没有微调等待元素的。 第二个问题是保护——cloudflare。 很快，如果不更改标头和/或 IP，您可能会被禁止。 所以有例子：

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup


def get_gf_score(ticker_symbol: str, timeout=10):
    driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
    try:
        element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
        WebDriverWait(driver, timeout).until(element_present)
        return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
    except TimeoutException:
        print("Timed out waiting for page to load")


tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
    print(ticker, get_gf_score(ticker), sep=': ')

OUTPUT：

AAPL: 98/100
MSFT: 97/100
AMZN: 88/100

Answer 2

获得所需值的一种方法是：

向页面发出请求 -假装请求来自浏览器，然后从script HTML 标签¹内的 JSON object 中提取您需要的信息。

笔记：

¹请在那里警告我无法获得 JSON object -这是JSON 结果，顺便说一句- 并按照以下路径提取值：
 js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']

因此，作为替代方案（恕我直言，这不是一个很好的选择，但可以满足您的目的），我决定在 JSON/字符串结果中找到某些元素，然后提取所需的值（通过划分字符串 - 即 substring ） .

完整代码：

import requests
from bs4 import BeautifulSoup
import json

geturl = r'https://www.gurufocus.com/stock/AAPL'

getheaders = {
    'Accept': 'text/html; charset=utf-8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
    'Referer': 'https://www.gurufocus.com'
}

s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")

# This is the "<script>" element tag that contains the full JSON object 
# with all the data.
scripts = soup.findAll("script")[1]

# Get only the JSON data: 
js_data = scripts.get_text("", strip=True)

# -- Get the value from the "gf_score" string - by getting its position:

# Part I: is where the "gf_score" begins. 
part_I = js_data.find("gf_score")

# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")

# Build the desired result and print it: 
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)

结果：

GF Score: 98/100

Answer 3

数据是动态的。 我认为rank是您正在寻找的，但 api 需要身份验证。 也许您可以使用selenium或playwright来呈现页面？

如何使用 Beautiful Soup 4 或 lxml 获取这些数据？

问题描述

3 个解决方案

解决方案1
0 2023-01-10 15:56:24

解决方案2
0 2023-01-12 14:07:21

解决方案3
-1 2023-01-10 07:25:33

如何使用 Beautiful Soup 4 或 lxml 获取这些数据？

问题描述

3 个解决方案

解决方案1 0 2023-01-10 15:56:24

解决方案2 0 2023-01-12 14:07:21

解决方案3 -1 2023-01-10 07:25:33

解决方案1
0 2023-01-10 15:56:24

解决方案2
0 2023-01-12 14:07:21

解决方案3
-1 2023-01-10 07:25:33