简体   繁体   中英

How to scrape news content and remove the irrelevant parts

I aim to scrape the 100 news texts using BeautifulSoup and for-loop, and store the texts into the list myarticle . I expect myarticle should only contain the content of the news articles, which I find all have h attribute . However, the result I got contain many irrelevant part, such as: "Thanks for contacting us. We've received your submission." and "This story has been shared 205,105 times. 205,105" and so on.

Another issue is, when I print(myarticle[0]) , it gives me many news articles, but I expect it should only give me 1 article.

I would like to know how could I remove the irrelevant part and only keep the main content as we read from the news web. And how could I adjust the code so that when I print(myarticle[0]) , it gives me the first news article.

One of the 100 news articles is on this page: https://nypost.com/2020/04/21/missouri-sues-china-over-coronavirus-deceit/

Other news articles I want to scrape are on this site: https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

Below are the lines of code relevant to my question.

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                articletext = soup.find_all('p')
                for paragraph in articletext[:-1]:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)

                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
    # join paragraphs to re-create the article            
    myarticle = [''.join(article) for article in thearticle]
    #show the first string of the list
    print(myarticle[0])
soup.find_all('p')

Here you find all P tag elements in the webpage. P is very common tag used for most text, that is why you find non article text.

I would first find the containing div for just the article and then get the text, something like:

container = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = container.find_all('p')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM