简体   繁体   中英

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

I wrote this test code which uses BeautifulSoup.

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
    print(n.get_text())

It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.

I would wish for it to only retrieve text from the news article itself, how would one go about this?

You'll need to target more specifically than just the p tag. Try looking for a div class="article" or something similar, then only grab paragraphs from there

You might have much better luck with newspaper library which is focused on scraping articles.

If we talk about BeautifulSoup only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div element with itemprop="articleBody" attribute:

article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
    print(p.get_text())

Be more specific, you need to catch the div with class articleBody , so :

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
    print(n.get_text())

Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs is a dict, it is then possible to pass more attributes/values if needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM