Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

Question

I wrote this test code which uses BeautifulSoup.

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
    print(n.get_text())

It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.

I would wish for it to only retrieve text from the news article itself, how would one go about this?

Answer 1

You'll need to target more specifically than just the p tag. Try looking for a div class="article" or something similar, then only grab paragraphs from there

Answer 2

You might have much better luck with newspaper library which is focused on scraping articles.

If we talk about BeautifulSoup only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div element with itemprop="articleBody" attribute:

article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
    print(p.get_text())

Answer 3

Be more specific, you need to catch the div with class articleBody , so :

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
    print(n.get_text())

Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs is a dict, it is then possible to pass more attributes/values if needed.

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

Question

3 answers

solution1
1 2016-09-19 19:30:33

solution2
1 2016-09-19 20:05:43

solution3
1 2016-09-19 20:10:21

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

Question

3 answers

solution1 1 2016-09-19 19:30:33

solution2 1 2016-09-19 20:05:43

solution3 1 2016-09-19 20:10:21

solution1
1 2016-09-19 19:30:33

solution2
1 2016-09-19 20:05:43

solution3
1 2016-09-19 20:10:21