简体   繁体   中英

Web scraping with Newspaper3k, got only 50 articles

I want to scrape data in a french website with newspaper3k and the result will be only 50 articles. This website has much more than 50 articles. Where am I wrong ?

My goal is to scrape all the articles in this website.

I tried this:

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr/', memoize_articles=False)

# Empty list to put all urls
papers = []

for article in legorafi_paper.articles:
    papers.append(article.url)

print(legorafi_paper.size())

The result of this print is 50 articles.

I don't understand why newspaper3k will only scrape 50 articles and not much more.

UPDATE OF WHAT I TRIED:

def Foo(firstTime = []):
    if firstTime == []:
        WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"div#appconsent>iframe")))
        firstTime.append('Not Empty')
    else:
        print('Cookies already accepted')


%%time


categories = ['societe', 'politique']


import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article

categories = ['people', 'sports']
papers = []


driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
driver.get('http://www.legorafi.fr/')


for category in categories:
    url = 'http://www.legorafi.fr/category/' + category
    #WebDriverWait(self.driver, 10)
    driver.get(url)
    Foo()
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button--filled>span.baseText"))).click()

    pagesToGet = 2

    title = []
    content = []
    for page in range(1, pagesToGet+1):
        print('Processing page :', page)
        #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
        print(driver.current_url)
        #print(url)

        time.sleep(3)

        raw_html = requests.get(url)
        soup = BeautifulSoup(raw_html.text, 'html.parser')
        for articles_tags in soup.findAll('div', {'class': 'articles'}):
            for article_href in articles_tags.find_all('a', href=True):
                if not str(article_href['href']).endswith('#commentaires'):
                    urls_set.add(article_href['href'])
                    papers.append(article_href['href'])


        for url in papers:
            article = Article(url)
            article.download()
            article.parse()
            if article.title not in title:
                title.append(article.title)
            if article.text not in content:
                content.append(article.text)
            #print(article.title,article.text)

        time.sleep(5)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        driver.find_element_by_xpath("//a[contains(text(),'Suivant')]").click()
        time.sleep(10)

UPDATE 09-21-2020

I rechecked your code and it is working correctly, because it is extracting all the articles on the main page of Le Gorafi . The articles on this page are highlights from the category pages, such as societe, sports, etc.

The example below is from the main page's source code. Each of these articles is also listed on the category page sports.

<div class="cat sports">
    <a href="http://www.legorafi.fr/category/sports/">
       <h4>Sports</h4>
          <ul>
              <li>
                 <a href="http://www.legorafi.fr/2020/07/24/chaque-annee-25-des-lutteurs-doivent-etre-operes-pour-defaire-les-noeuds-avec-leur-bras/" title="Voir l'article 'Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras'">
                  Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras</a>
              </li>
               <li>
                <a href="http://www.legorafi.fr/2020/07/09/frank-mccourt-lom-nest-pas-a-vendre-sauf-contre-beaucoup-dargent/" title="Voir l'article 'Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent »'">
                  Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent </a>
              </li>
              <li>
                <a href="http://www.legorafi.fr/2020/06/10/euphorique-un-parieur-appelle-son-fils-betclic/" title="Voir l'article 'Euphorique, un parieur appelle son fils Betclic'">
                  Euphorique, un parieur appelle son fils Betclic                 </a>
              </li>
           </ul>
               <img src="http://www.legorafi.fr/wp-content/uploads/2015/08/rubrique_sport1-300x165.jpg"></a>
        </div>
              </div>

It seems that there are 35 unique article entries on the main page.

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr', memoize_articles=False)

papers = []
urls_set = set()
for article in legorafi_paper.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     # remove all links for article commentaires
     if not str(article.url).endswith('#commentaires'):
        papers.append(article.url)

 print(len(papers)) 
 # output
 35

If I change the URL in the code above to this: http://www.legorafi.fr/category/sports , it returns the same number of articles as http://www.legorafi.fr . After looking at the source code for Newspaper on GitHub , it seems that the module is using urlparse , which seems to be using the netloc segment of urlparse . The netloc is www.legorafi.fr . I noted that this is a known problem with Newspaper based on this open issue.

To obtain all the articles it becomes more complex, because you have to use some additional modules, including requests and BeautifulSoup . The latter can be called from Newspaper . The code below can be refined to obtain all the articles within the source code on the main page and category pages using requests and BeautifulSoup .

import newspaper
import requests
from newspaper.utils import BeautifulSoup

papers = []
urls_set = set()

legorafi_paper = newspaper.build('http://www.legorafi.fr', 
fetch_images=False, memoize_articles=False)
for article in legorafi_paper.articles:
   if article.url not in urls_set:
     urls_set.add(article.url)
     if not str(article.url).endswith('#commentaires'):
       papers.append(article.url)

 
legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre',
             'politique': 'http://www.legorafi.fr/category/france/politique',
             'societe': 'http://www.legorafi.fr/category/france/societe',
             'economie': 'http://www.legorafi.fr/category/france/economie',
             'culture': 'http://www.legorafi.fr/category/culture',
             'people': 'http://www.legorafi.fr/category/people',
             'sports': 'http://www.legorafi.fr/category/sports',
             'hi-tech': 'http://www.legorafi.fr/category/hi-tech',
             'sciences': 'http://www.legorafi.fr/category/sciences',
             'ledito': 'http://www.legorafi.fr/category/ledito/'
             }


for category, url in legorafi_urls.items():
   raw_html = requests.get(url)
   soup = BeautifulSoup(raw_html.text, 'html.parser')
   for articles_tags in soup.findAll('div', {'class': 'articles'}):
      for article_href in articles_tags.find_all('a', href=True):
         if not str(article_href['href']).endswith('#commentaires'):
           urls_set.add(article_href['href'])
           papers.append(article_href['href'])

   print(len(papers))
   # output
   155

If you need to obtain the articles listed in the subpages of a category page (politique currently has 120 subpages) then you would have to use something like Selenium to click the links.

Hopefully, this code helps you get closer to achieving your objective.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM