如何抓取新聞內容並刪除不相關的部分

Question

我的目標是使用 BeautifulSoup 和 for 循環抓取 100 條新聞文本，並將文本存儲到myarticle列表中。 我希望myarticle應該只包含新聞文章的內容，我發現它們都有h 屬性。 但是，我得到的結果包含許多不相關的部分，例如：“感謝您與我們聯系。我們已收到您的提交。” 以及“這個故事被分享了 205,105 次。205,105 次”等等。

另一個問題是，當我print(myarticle[0])時，它給了我很多新聞文章，但我希望它應該只給我一篇文章。

我想知道如何刪除不相關的部分，只保留我們從新聞 web 中讀到的主要內容。 以及如何調整代碼以便當我print(myarticle[0])時，它給了我第一篇新聞文章。

100 篇新聞文章之一在此頁面上： https://nypost.com/2020/04/21/missouri-sues-china-over-coronavirus-deceit/

我想抓取的其他新聞文章在這個網站上： https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

以下是與我的問題相關的代碼行。

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                articletext = soup.find_all('p')
                for paragraph in articletext[:-1]:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)

                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
    # join paragraphs to re-create the article            
    myarticle = [''.join(article) for article in thearticle]
    #show the first string of the list
    print(myarticle[0])

Answer 1

soup.find_all('p')

在這里您可以找到網頁中的所有 P 標簽元素。 P 是用於大多數文本的非常常見的標簽，這就是為什么您會找到非文章文本的原因。

我會首先找到僅包含文章的 div，然后獲取文本，例如：

container = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = container.find_all('p')

如何抓取新聞內容並刪除不相關的部分

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-19 09:06:52

如何抓取新聞內容並刪除不相關的部分

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-19 09:06:52

解決方案1
1 已采納 2020-05-19 09:06:52