I aim to scrape the 100 news texts using BeautifulSoup and for-loop, and store the texts into the list myarticle . I expect myarticle should only contain the content of the news articles, which I find all have h attribute . However, the result I got contain many irrelevant part, such as: "Thanks for contacting us. We've received your submission." and "This story has been shared 205,105 times. 205,105" and so on.
Another issue is, when I print(myarticle[0]) , it gives me many news articles, but I expect it should only give me 1 article.
I would like to know how could I remove the irrelevant part and only keep the main content as we read from the news web. And how could I adjust the code so that when I print(myarticle[0]) , it gives me the first news article.
One of the 100 news articles is on this page: https://nypost.com/2020/04/21/missouri-sues-china-over-coronavirus-deceit/
Other news articles I want to scrape are on this site: https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance
Below are the lines of code relevant to my question.
for pagelink in pagelinks:
#get page text
page = requests.get(pagelink)
#parse with BeautifulSoup
soup = bs(page.text, 'lxml')
articletext = soup.find_all('p')
for paragraph in articletext[:-1]:
#get the text only
text = paragraph.get_text()
paragraphtext.append(text)
#combine all paragraphs into an article
thearticle.append(paragraphtext)
# join paragraphs to re-create the article
myarticle = [''.join(article) for article in thearticle]
#show the first string of the list
print(myarticle[0])
soup.find_all('p')
Here you find all P tag elements in the webpage. P is very common tag used for most text, that is why you find non article text.
I would first find the containing div for just the article and then get the text, something like:
container = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = container.find_all('p')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.