[英]Data scraping of an aria-label with beautifulsoup
From the following, i am trying to extract the analysts price targets.从以下内容中,我试图提取分析师的价格目标。 I am interested in the information present inside the aria-label.
我对 aria-label 中的信息感兴趣。
I tried multiple versions of BeautifulSoup
I found online with the following setup:我尝试使用以下设置在网上找到的
BeautifulSoup
的多个版本:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'XXXXX'} >> XXXXX replaced with actual
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
The aria-label seems to be between a 'div'
and a 'class'
, so I tried the following: aria-label 似乎介于
'div'
和'class'
之间,所以我尝试了以下操作:
target = soup.find('div', {'class':'Px(10px)'})
Result = None结果 = 无
It is inside a section, so I tried the following:它在一个部分内,所以我尝试了以下操作:
target = soup.find('section', attrs={'data-test':'price-targets'})
Result = None结果 = 无
Then I tried to go even upper, using the ID:然后我尝试了 go 甚至上层,使用 ID:
target = soup.find('div', {'id':'mrt-node-Col2-5-QuoteModule'}).find_all('div')[0]
Result = < div data-react-checksum="2049647463" data-reactid="1" data-reactroot="" id="Col2-5-QuoteModule-Proxy">< span data-reactid="2">< /span>< /div>
结果 =
< div data-react-checksum="2049647463" data-reactid="1" data-reactroot="" id="Col2-5-QuoteModule-Proxy">< span data-reactid="2">< /span>< /div>
Thus, I am getting closer with option 3, but I receive an error when I modify the find_all
div index因此,我越来越接近选项 3,但是当我修改
find_all
div 索引时收到错误
Is there any solution or turnaround to extract the 4 data present in the aria-label?是否有任何解决方案或周转时间来提取 aria-label 中存在的 4 个数据?
The numbers next to 'Low'
, 'Current'
, 'Average'
& 'High'
are my target. 'Low'
、 'Current'
、 'Average'
和'High'
旁边的数字是我的目标。
As selenium
might consume time to iterate, I found a second possible solution to my issue which is to get the source code of the page using requests
, and search for the data with a combination of json & regex.由于
selenium
可能会花费时间进行迭代,我找到了第二种可能的解决方案,即使用requests
获取页面的源代码,并结合 json 和正则表达式搜索数据。
As @Ann Zen mentioned in commented the website is rendering elements and data dynamically and Beautifulsoup
can't handle it alone using Selenium
will wait till the time app is loaded and then try to get the element正如@Ann Zen 在评论中提到的那样,该网站正在动态呈现元素和数据,
Beautifulsoup
无法单独使用Selenium
来处理它会等到应用程序加载的时间,然后尝试获取元素
Example web-scraping-with-selenium示例web-scraping-with-selenium
What about yfinance
Python package ? yfinance
Python package呢? To get analyst price targets to you need to scroll the page to the end and wait until the data is loaded.要获得分析师的价格目标,您需要将页面滚动到末尾并等待数据加载。
In this case, selenium
library is used, which allows you to simulate user actions in the browser.在本例中,使用
selenium
库,它允许您在浏览器中模拟用户操作。
Install libraries:安装库:
pip install bs4 lxml selenium webdriver webdriver_manager
Import libraries:导入库:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml
For selenium
to work, you need to use ChromeDriver
, which can be downloaded manually or using code.要使
selenium
正常工作,您需要使用ChromeDriver
,可以手动下载或使用代码下载。 In our case, the second method is used.在我们的例子中,使用了第二种方法。 To control the start and stop of
ChromeDriver
, you need to use Service
which will install browser binaries under the hood:要控制
ChromeDriver
的启动和停止,您需要使用Service
来安装浏览器二进制文件:
service = Service(ChromeDriverManager().install())
You should also add options
to work correctly:您还应该添加
options
才能正常工作:
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')
--headless
- to run Chrome in headless mode. --headless
- 以无头模式运行 Chrome。--lang=en
- to set the browser language to English. --lang=en
- 将浏览器语言设置为英语。user-agent
- to act as a "real" user request from the browser by passing it to request headers . user-agent
- 通过将其传递给请求标头来充当来自浏览器的“真实”用户请求。Check what's your user-agent
.user-agent
是什么。 Now we can start webdriver
and pass the URL
to the get()
method :现在我们可以启动
webdriver
并将URL
传递给get()
方法:
URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)
The page scrolling algorithm looks like this:页面滚动算法如下所示:
old_height
variable.old_height
变量。new_height
variable.new_height
变量。new_height
and old_height
are equal, then we complete the algorithm, otherwise we write the value of the variable new_height
to the variable old_height
and return to step 2.new_height
和old_height
相等,则我们完成算法,否则我们将变量new_height
的值写入变量old_height
并返回步骤 2。 Getting the page height and scroll is done by pasting the JavaScript code into the execute_script()
method:通过将 JavaScript 代码粘贴到
execute_script()
方法中来获取页面高度和滚动:
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
We create the soup
object and stop the driver:我们创建
soup
object 并停止驱动程序:
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
Selecting an element by class is not always a good idea because classes can change.通过class选择一个元素并不总是一个好主意,因为类可以改变。 It is more reliable to access attributes .
访问属性更可靠。 In this case, I'm accessing the
data-test
attribute with the value price-targets
and then to the div
inside it.在这种情况下,我正在访问具有值
price-targets
的data-test
属性,然后访问其中的div
。 The value of the aria-label
attribute is retrieved and printed from the resulting object:从结果 object 中检索并打印
aria-label
属性的值:
price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')
print(price_targets)
If you want to extract other data, you can see the Scrape Yahoo!如果你想提取其他数据,你可以看到Scrape Yahoo! Finance Home Page with Python blog post, which describes this in detail.
Finance Home Page with Python博文,里面对此有详细的介绍。
Code and full example in online IDE : 在线 IDE 中的代码和完整示例:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml
URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')
driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('#Aside').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')
print(price_targets)
Output: Output:
Low 122 Current 129.62 Average 174.62 High 214
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.