簡體   English   中英

當我嘗試使用 BeautifulSoup 從網站抓取時缺少文本

[英]Text is missing when I try to scrape from a website using BeautifulSoup

我試圖從倫敦證券交易所的一篇新聞文章中抓取正文,但是當我嘗試使用 BeautifulSoup 提取它時,它沒有出現。 有誰知道我如何提取這些信息?

我可以在單擊檢查時找到標簽,但是當我查看源代碼 (Ctrl + U) 時,文本不會出現。 我認為信息可能是從另一個站點加載到此站點上的,但是我不確定這一點,也不知道如何抓取它。

我正在查看的網站是: https : //www.londonstockexchange.com/news-article/PFG/interim-results-for-six-months-ended-30-june-2020/14665452

我正在嘗試提取有關中期結果的主要內容。

文章存儲在頁面內的<script>標簽內。 您可以使用此示例來提取它:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.londonstockexchange.com/news-article/PFG/interim-results-for-six-months-ended-30-june-2020/14665452'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = soup.select_one('#ng-lseg-state').string.replace('&q;', '"').replace('&l;', '<').replace('&g;', '>').replace('&a;', '&').replace('&s;', "'")
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

def find_news_article(data):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == 'newsArticle':
                yield v
            else:
                yield from find_news_article(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_news_article(v)

article = BeautifulSoup(next(find_news_article(data))['value'], 'html.parser')

# print text from article on screen:
print(article.get_text(strip=True, separator='\n'))

印刷:

RNS Number : 1348X
Provident Financial PLC
26 August 2020
Provident Financial plc
Interim results for the six months ended 30 June 2020
Provident Financial plc ('the Group') is the leading provider of credit products to consumers who are underserved by mainstream lenders. The Group serves c.2.2 million customers and its operations consist of Vanquis Bank, Moneybarn, and the Consumer Credit Division ('CCD') comprising Provident home credit and Satsuma.

...and so on.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM