python web 在同一 class 中爬行

Question

我是 python 和网络爬虫的初学者。

考虑这个 web 站点： https://finance.yahoo.com/quote/AWR?p=AWR

我想爬行远期股息和收益率，但它的工作原理很奇怪：它打印“除息日期”

这是我的代码

import pandas as pd
import datetime
import requests
import yfinance as yf
import time
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup



def web_content_div(web_content,class_path):
    web_content_div = web_content.find_all('div',{'class': class_path})
    try:
        spans = web_content_div[0].find_all('span')
        texts = [span.get_text() for span in spans]
    except IndexError:
        texts = []
    
    return texts

def real_time_price(stock_code):
    
    url = 'https://finance.yahoo.com/quote/' + stock_code + '?p=' + stock_code 
   
    try :
        r = requests.get(url)
        web_content = BeautifulSoup(r.text,'lxml')
        texts = web_content_div(web_content, 'My(6px) Pos(r) smartphone_Mt(6px)')
        if texts != []:
            price, change = texts[0],texts[1]
        else:
            price , change = [] , []
    
    #############################################################it doesn't works#####################################################
        texts = web_content_div(web_content,'D(ib) W(1/2) Bxz(bb) Pstart(12px) Va(t) ie-7_D(i) ie-7_Pos(a) smartphone_D(b) smartphone_W(100%) smartphone_Pstart(0px) smartphone_BdB smartphone_Bdc($seperatorColor)')
        if texts != []:
            for count, div in enumerate(texts):
                if div == 'Forward Dividend & Yield':
                   dividend = texts[count + 1]
        else:
            dividend = []
    #############################################################it doesn't works#####################################################
        
        texts = web_content_div(web_content,'D(ib) W(1/2) Bxz(bb) Pstart(12px) Va(t) ie-7_D(i) ie-7_Pos(a) smartphone_D(b) smartphone_W(100%) smartphone_Pstart(0px) smartphone_BdB smartphone_Bdc($seperatorColor)')
        if texts != []:
            for count, EX in enumerate(texts):
                if EX == 'Ex-Dividend Date':
                    EXdate = texts[count + 1]
        else:
            EXdate = []


    
        texts = web_content_div(web_content,'D(ib) W(1/2) Bxz(bb) Pend(12px) Va(t) ie-7_D(i) smartphone_D(b) smartphone_W(100%) smartphone_Pend(0px) smartphone_BdY smartphone_Bdc($seperatorColor)')
        if texts != []:
            for count, vol in enumerate(texts):
                if vol == 'Volume':
                    volume = texts[count + 1]
        else:
            volume = []

   
        texts = web_content_div(web_content, 'D(ib) W(1/2) Bxz(bb) Pstart(12px) Va(t) ie-7_D(i) ie-7_Pos(a) smartphone_D(b) smartphone_W(100%) smartphone_Pstart(0px) smartphone_BdB smartphone_Bdc($seperatorColor)')
        if texts != []:
            for count, target in enumerate(texts):
                if target == '1y Target Est':
                    one_year_target = texts[count +1]

        else:
            one_year_target = []


    except ConnectionError:
        price, change,dividend, EXdate,volume,one_year_target = [],[],[],[],[],[]

    return price, change,dividend, EXdate,volume,one_year_target


stock=['awr','aapl']


while(True):
    info = []
    col = []
    time_stamp = datetime.datetime.now() - datetime.timedelta(hours=13)
    time_stamp = time_stamp.strftime('%Y-%M-%D %H:%M:%S')
    for stock_code in stock:
        price, change,dividend, EXdate,volume,one_year_target = real_time_price(stock_code)
        info.append(price)
        info.extend([change])
        info.extend([dividend])
        info.extend([EXdate])
        info.extend([volume])
        info.extend([one_year_target])
        time.sleep(5)

    col = [time_stamp]
    col.extend(info)
    print(col)

它打印

'2021-41-03/14/21 19:41:16', '72.16', '+0.38 (+0.53%)', 'Ex-Dividend Date', 'Feb 12, 2021', '288,352', '77.00',

Answer 1

出现此问题的原因是，由于某种原因，Yahoo 页面没有围绕您要阅读的值设置span 。

例如，这就是我看到的您链接的结果：

<tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px) " data-reactid="108">
  <td class="C($primaryColor) W(51%)" data-reactid="109">
    <span data-reactid="110">Forward Dividend &amp; Yield</span>
  </td>
  <td class="Ta(end) Fw(600) Lh(14px)" data-test="DIVIDEND_AND_YIELD-value" data-reactid="111">1.34 (1.86%)</td>
</tr>

因此，您需要自己匹配适当的td并获取其文本而不是span ，而不是spans = web_content_div[0].find_all('span') 。

一个快速测试表明，仅使用它适用于该领域，但会破坏其他一些领域：

spans = web_content_div[0].find_all('td')

因此，这显然不是完整的解决方案，但表明这确实是问题所在。 您需要提出一个与您感兴趣的所有值相匹配的选择标准。

另请注意，您反复调用web_content_div ，您也可以在其中检索一次并重用它。

python web 在同一 class 中爬行

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-15 01:17:15

python web 在同一 class 中爬行

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-15 01:17:15

解决方案1
1 已采纳 2021-03-15 01:17:15