![](/img/trans.png)
[英]How to fetch some data conditionally with Python and Beautiful Soup
[英]How to fetch this data with Beautiful Soup 4 or lxml?
这是有问题的网站:
https://www.gurufocus.com/stock/AAPL
我感兴趣的部分是这个(它是网站上部的 GF Score):
我需要提取字符串“GF Score”和“98/100”。
Firefox Inspector 给我 span.t-h6 > span:nth-child(1) 作为 CSS 选择器,但我似乎既无法获取数字也无法获取描述符。
这是我到目前为止用来提取“GF Score”部分的代码:
import requests
import bs4 as BeautifulSoup
from lxml import html
req = requests.get('https://www.gurufocus.com/stock/AAPL')
soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)
tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[@id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)
结果,我得到了三个空括号。
xpath 是通过复制 function 和 BS4 部分中的 nth-child 表达式直接从 chrome 中取出的。
关于这里可能有什么问题的任何建议?
不幸的是无法使用Requests lib获取页面,并且无法访问需要签名的api。 有2个选项:
使用API 。 它不是免费的,但更方便、更快捷。 第二个 - Selenium 。 它是免费的,但速度是没有微调等待元素的。 第二个问题是保护——cloudflare。 很快,如果不更改标头和/或 IP,您可能会被禁止。 所以有例子:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def get_gf_score(ticker_symbol: str, timeout=10):
driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
try:
element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
WebDriverWait(driver, timeout).until(element_present)
return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
except TimeoutException:
print("Timed out waiting for page to load")
tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
print(ticker, get_gf_score(ticker), sep=': ')
OUTPUT:
AAPL: 98/100
MSFT: 97/100
AMZN: 88/100
获得所需值的一种方法是:
script
HTML 标签1内的 JSON object 中提取您需要的信息。笔记:
1请在那里警告我无法获得 JSON object -这是JSON 结果,顺便说一句- 并按照以下路径提取值:
js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']
因此,作为替代方案(恕我直言,这不是一个很好的选择,但可以满足您的目的),我决定在 JSON/字符串结果中找到某些元素,然后提取所需的值(通过划分字符串 - 即 substring ) .
完整代码:
import requests
from bs4 import BeautifulSoup
import json
geturl = r'https://www.gurufocus.com/stock/AAPL'
getheaders = {
'Accept': 'text/html; charset=utf-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.gurufocus.com'
}
s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")
# This is the "<script>" element tag that contains the full JSON object
# with all the data.
scripts = soup.findAll("script")[1]
# Get only the JSON data:
js_data = scripts.get_text("", strip=True)
# -- Get the value from the "gf_score" string - by getting its position:
# Part I: is where the "gf_score" begins.
part_I = js_data.find("gf_score")
# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")
# Build the desired result and print it:
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)
结果:
GF Score: 98/100
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.