简体   繁体   中英

Web Scraping news articles in some cases returns empty body

I just wanted to scrape a few articles from El Pais website archive. From each article I take: title, hashtags and article body. The HTML structure of each article is the same and script is successful with all the titles and hashtags, however for some of the articles it does not scrape the body at all. Below I add my code, links to fully working articles and also a few links to the ones returning empty bodies. Do you know how to fix it? The empty body articles do not happen regularly, so sometimes there can be 3 empty articles in a row, then 5 successful articles, 1 empty, 3 successful.

Working articles article1 https://elpais.com/diario/1990/01/17/economia/632530813_850215.html article2 https://elpais.com/diario/1990/01/07/internacional/631666806_850215.html article3 https://elpais.com/diario/1990/01/05/deportes/631494011_850215.html

Articles without the body article4 https://elpais.com/diario/1990/01/23/madrid/633097458_850215.html article5 https://elpais.com/diario/1990/01/30/economia/633654016_850215.html article6 https://elpais.com/diario/1990/01/03/espana/631321213_850215.html

    from bs4 import BeautifulSoup
    import requests
    #place for the url of the article to be scraped
    URL = some_url_of_article_above
    #print(URL)
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    bodydiv = soup.find("div", id="ctn_article_body")
    artbody = bodydiv.find_all("p", class_="")
    tagdiv = soup.find("div", id="mod_archivado")
    hashtags= tagdiv.find_all("li", class_="w_i | capitalize flex align_items_center")
    titlediv = soup.find("div", id="article_header")
    title = titlediv.find("h1")
    #print title of the article
    print(title.text)
    #print body of the article
    arttext = ""
    for par in artbody:
        arttext += str(par.text)
    print(arttext)
    #hastags
    tagstring = ""
    for hashtag in hashtags:
        tagstring += hashtag.text
        tagstring += ","
    print(tagstring)

Thank you in advance for your help!

The problem is that inside that <div class="a_b article_body | color_gray_dark" id="ctn_article_body"> element there's a broken or incomplete <b> tag. Take a look at this code snippet from the html page:

<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa dela Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p>

Just after the first <p></p> tags, there an </b> without its pair <b> tag. That's the reason because "html.parser" it is failing.

Using this text,

from bs4 import BeautifulSoup

text = """<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa de la Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p><div id="elpais_gpt-INTEXT" style="width: 0px; height: 0px; display: none;"></div><p class="">Por su parte, José Luis Garro, tercer teniente de alcalde, ha declarado a EL PAÍS: "Tenemos una autorización provisional del rector de la Universidad Complutense. Toda esa zona, además, está pendiente de un plan especial de reforma interior (PERI). Ésta es sólo una solución provisional".</p><p class="">Según Garro, el trazado de la carretera "ha tenido que dar varias vueltas para no tocar las masas arbóreas", aunque reconoce que se ha hecho "en algunos casos", si bien causando "un daño mínimo".</p><p class="footnote">* Este artículo apareció en la edición impresa del lunes, 22 de enero de 1990.</p></div>"""

soup = BeautifulSoup(text, "html.parser")
print(soup.find("div"))

Output:

<div class="a_b article_body | color_gray_dark" id="ctn_article_body"><p class=""></p></div>

How to solve this? Well I made another try with a different parser, in this case I made use of "lxml" instead of "html.parser" , and it works.

It selected the div, so just changing this line should work

soup = BeautifulSoup(text, "lxml")

Of course you will need to have this parser installed.

EDIT:

As @moreni123 commented below, this solution seems to be correct for certain cases but not for all. Given that, I will add another option that could also work.

It seems that it would be better to use Selenium to fetch webpage, given that some content is been generated with JavaScript and requests cannot do that, it's not its purpose.

I'm going to use Selenium with a headless chrome driver,

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# article to fetch
url = "https://elpais.com/diario/1990/01/14/madrid/632319855_850215.html"

driver_options = Options()
driver_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)

# this is the source code with the js executed
driver.get(url)
page = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

# now, as before we use BeautifulSoup to parse it. Selenium is a 
# powerful, tool you could use Selenium for this also
soup = BeautifulSoup(page, "html.parser")
print(soup.select("#ctn_article_body"))

#quiting driver
if driver is not None:
    driver.quit()

Make sure that the path to the chrome driver is correct, in this line

driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)

Here is a link to the Selenium doc , and to the ChromeDriver . In case you need to download it.

This solution should work. At least in this article that you passed me, it works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM