简体   繁体   English

使用 BeautifulSoup 和 RegEx 或 Lambda 循环访问 URL 以进行匹配?

[英]Loop through URLs using BeautifulSoup, and either RegEx or Lambda, to do matching?

I am trying to loop through a few URLs and scrape out one specific class.我正在尝试遍历几个 URL 并刮出一个特定的 class。 I believe it's called:我相信它被称为:

<div class="Fw(b) Fl(end)--m Fz(s) C($primaryColor" data-reactid="192">Overvalued</div>

Here is the URL:这是 URL:

https://finance.yahoo.com/quote/goog https://finance.yahoo.com/quote/goog

Here is the data that I want for GOOG.这是我想要的 GOOG 数据。

Near Fair Value

I believe this will require some kind of Lambda function or RegEx.我相信这将需要某种 Lambda function 或 RegEx。 I tried to do this without using these methodologies, but I couldn't get it working.我试图在不使用这些方法的情况下做到这一点,但我无法让它发挥作用。 Here is the code that I am testing.这是我正在测试的代码。

import requests
from bs4 import BeautifulSoup
import re

mylink = "https://finance.yahoo.com/quote/"
mylist = ['SBUX', 'MSFT', 'GOOG']
mystocks = []

html = requests.get(mylink).text
soup = BeautifulSoup(html, "lxml")

#details = soup.findAll("div", {"class" : lambda L: L and L.startswith('Fw(b) Fl(end)--m')})

details = soup.findAll('div', {'class' : re.compile('Fw(b)*')})
for item in mylist:
    for r in details:
        mystocks.append(item + ' - ' + details)

print(mystocks)

Here is a screen shot:这是一个屏幕截图:

在此处输入图像描述

After the code runs, I would like to see something like this.代码运行后,我想看到这样的东西。

GOOG - Near Fair Value
SBUX - Near Fair Value
MSFT - Overvalued

The problem is, that if I use something like this: 'Fw(b)*' , I get too much data pulled back.问题是,如果我使用这样的东西: 'Fw(b)*' ,我会拉回太多数据。 If I try to expand that, to this: 'Fw(b) Fl(end)--m Fz(s)' , I get nothing back.如果我尝试将其扩展为: 'Fw(b) Fl(end)--m Fz(s)' ,我将一无所获。 How can I get the results I showed above?我怎样才能得到上面显示的结果?

No need to use regex, CSS selector is enough.无需使用正则表达式,CSS 选择器就足够了。 The key is to use correct HTTP header - User-Agent .关键是使用正确的 HTTP header - User-Agent

For example:例如:

import requests
from bs4 import BeautifulSoup

urls = [('GOOG', 'https://finance.yahoo.com/quote/goog'),
        ('SBUX', 'https://finance.yahoo.com/quote/sbux'),
        ('MSFT', 'https://finance.yahoo.com/quote/msft')]

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

for q, url in urls:
    soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
    value = soup.select_one('div:contains("XX.XX") + div').text

    print('{:<10}{}'.format(q, value))

Prints:印刷:

GOOG      Near Fair Value
SBUX      Near Fair Value
MSFT      Overvalued

The issue is that the html being returned from your requests get-request and from a request in Broswer, eg view-source:https://finance.yahoo.com/quote/goog are different.问题是从您的requests get-request 和 Broswer 中的请求返回的 html 不同,例如view-source:https://finance.yahoo.com/quote/goog The div that has your target class is missing when you visit the page using python.当您使用 python 访问页面时,缺少目标 class 的 div。 I found that out by printing the html from requests and comparing with the browser html.通过从requests中打印 html 并与浏览器 html 进行比较,我发现了这一点。

Here are the suggested steps to take:以下是建议采取的步骤:

  1. Append the link ending to each url by looping through mylist Append 通过循环遍历mylist到每个 url 的链接

  2. The Yahoo server detects that you are a robot by reading your request's headers and limits some information.雅虎服务器通过读取您的请求标头并限制一些信息来检测您是机器人。 You need to add relevant headers to disguise your request.您需要添加相关标头来掩饰您的请求。

  3. I suspect that the yahoo server is only reading your user-agent but I'll leave that for you to experiment;我怀疑雅虎服务器只是在读取您的user-agent ,但我将把它留给您进行实验; and its good for me to post the full headers here for reference purposes.在这里发布完整的标题以供参考对我有好处。

The headers can be got from the chrome dev-tools, in the network tab.可以从网络选项卡中的 chrome 开发工具中获取标头。 Trillworks Online tool shows how to do so and helps you convert them to requests code. Trillworks Online 工具展示了如何执行此操作并帮助您将它们转换为请求代码。

Proposed solution:建议的解决方案:

import requests
from bs4 import BeautifulSoup

mylink = "https://finance.yahoo.com/quote/"
mylist = ['SBUX', 'MSFT', 'GOOG']
mystocks = []

headers = {
    'authority': 'finance.yahoo.com',
    'cache-control': 'max-age=0',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en,en-US;q=0.9,fr-FR;q=0.8,fr;q=0.7,ar-EG;q=0.6,ar;q=0.5,my-ZG;q=0.4,my;q=0.3',

    'cookie': '', # Note: I removed the cookie value, it was too long
}

for item in mylist:
    html = requests.get(mylink + item, headers=headers).text
    soup = BeautifulSoup(html, "lxml")
    details = soup.find('div', class_="Fw(b) Fl(end)--m Fz(s) C($primaryColor")
    mystocks.append(item + ' - ' + details.text)

print(mystocks)

This prints:这打印:

GOOG - Near Fair Value
SBUX - Near Fair Value
MSFT - Overvalued

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM