简体   繁体   English

使用 Newspaper3k 进行网页抓取,仅获得 50 篇文章

[英]Web scraping with Newspaper3k, got only 50 articles

I want to scrape data in a french website with newspaper3k and the result will be only 50 articles.我想用news3k在法国网站上抓取数据,结果只有 50 篇文章。 This website has much more than 50 articles.这个网站有超过 50 篇文章。 Where am I wrong ?我哪里错了?

My goal is to scrape all the articles in this website.我的目标是抓取本网站的所有文章。

I tried this:我试过这个:

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr/', memoize_articles=False)

# Empty list to put all urls
papers = []

for article in legorafi_paper.articles:
    papers.append(article.url)

print(legorafi_paper.size())

The result of this print is 50 articles.此打印的结果是 50 篇文章。

I don't understand why newspaper3k will only scrape 50 articles and not much more.我不明白为什么 news3k 只会抓取50 篇文章而不是更多。

UPDATE OF WHAT I TRIED:我尝试的更新:

def Foo(firstTime = []):
    if firstTime == []:
        WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"div#appconsent>iframe")))
        firstTime.append('Not Empty')
    else:
        print('Cookies already accepted')


%%time


categories = ['societe', 'politique']


import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article

categories = ['people', 'sports']
papers = []


driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
driver.get('http://www.legorafi.fr/')


for category in categories:
    url = 'http://www.legorafi.fr/category/' + category
    #WebDriverWait(self.driver, 10)
    driver.get(url)
    Foo()
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button--filled>span.baseText"))).click()

    pagesToGet = 2

    title = []
    content = []
    for page in range(1, pagesToGet+1):
        print('Processing page :', page)
        #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
        print(driver.current_url)
        #print(url)

        time.sleep(3)

        raw_html = requests.get(url)
        soup = BeautifulSoup(raw_html.text, 'html.parser')
        for articles_tags in soup.findAll('div', {'class': 'articles'}):
            for article_href in articles_tags.find_all('a', href=True):
                if not str(article_href['href']).endswith('#commentaires'):
                    urls_set.add(article_href['href'])
                    papers.append(article_href['href'])


        for url in papers:
            article = Article(url)
            article.download()
            article.parse()
            if article.title not in title:
                title.append(article.title)
            if article.text not in content:
                content.append(article.text)
            #print(article.title,article.text)

        time.sleep(5)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        driver.find_element_by_xpath("//a[contains(text(),'Suivant')]").click()
        time.sleep(10)

UPDATE 09-21-2020更新 09-21-2020

I rechecked your code and it is working correctly, because it is extracting all the articles on the main page of Le Gorafi .我重新检查了您的代码并且它工作正常,因为它正在提取Le Gorafi主页上的所有文章。 The articles on this page are highlights from the category pages, such as societe, sports, etc.此页面上的文章是来自分类页面的亮点,例如社会、体育等。

The example below is from the main page's source code.下面的示例来自主页的源代码。 Each of these articles is also listed on the category page sports.这些文章中的每一篇也都列在体育类别页面上。

<div class="cat sports">
    <a href="http://www.legorafi.fr/category/sports/">
       <h4>Sports</h4>
          <ul>
              <li>
                 <a href="http://www.legorafi.fr/2020/07/24/chaque-annee-25-des-lutteurs-doivent-etre-operes-pour-defaire-les-noeuds-avec-leur-bras/" title="Voir l'article 'Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras'">
                  Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras</a>
              </li>
               <li>
                <a href="http://www.legorafi.fr/2020/07/09/frank-mccourt-lom-nest-pas-a-vendre-sauf-contre-beaucoup-dargent/" title="Voir l'article 'Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent »'">
                  Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent </a>
              </li>
              <li>
                <a href="http://www.legorafi.fr/2020/06/10/euphorique-un-parieur-appelle-son-fils-betclic/" title="Voir l'article 'Euphorique, un parieur appelle son fils Betclic'">
                  Euphorique, un parieur appelle son fils Betclic                 </a>
              </li>
           </ul>
               <img src="http://www.legorafi.fr/wp-content/uploads/2015/08/rubrique_sport1-300x165.jpg"></a>
        </div>
              </div>

It seems that there are 35 unique article entries on the main page.主页上似乎有 35 个独特的文章条目。

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr', memoize_articles=False)

papers = []
urls_set = set()
for article in legorafi_paper.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     # remove all links for article commentaires
     if not str(article.url).endswith('#commentaires'):
        papers.append(article.url)

 print(len(papers)) 
 # output
 35

If I change the URL in the code above to this: http://www.legorafi.fr/category/sports , it returns the same number of articles as http://www.legorafi.fr .如果我更改URL上面这个代码: http://www.legorafi.fr/category/sports ,它返回相同数量的物品http://www.legorafi.fr After looking at the source code for Newspaper on GitHub , it seems that the module is using urlparse , which seems to be using the netloc segment of urlparse .看着报纸上的源代码后GitHub上,似乎该模块使用里urlparse ,这似乎是使用的里urlparsenetloc段。 The netloc is www.legorafi.fr . netlocwww.legorafi.fr I noted that this is a known problem with Newspaper based on this open issue.我注意到这是基于此未解决问题的Newspaper的一个已知问题

To obtain all the articles it becomes more complex, because you have to use some additional modules, including requests and BeautifulSoup .要获取所有文章,它变得更加复杂,因为您必须使用一些额外的模块,包括requestsBeautifulSoup The latter can be called from Newspaper .后者可以从Newspaper调用。 The code below can be refined to obtain all the articles within the source code on the main page and category pages using requests and BeautifulSoup .下面的代码可以细化为使用requestsBeautifulSoup获取主页和类别页面上的源代码中的所有文章。

import newspaper
import requests
from newspaper.utils import BeautifulSoup

papers = []
urls_set = set()

legorafi_paper = newspaper.build('http://www.legorafi.fr', 
fetch_images=False, memoize_articles=False)
for article in legorafi_paper.articles:
   if article.url not in urls_set:
     urls_set.add(article.url)
     if not str(article.url).endswith('#commentaires'):
       papers.append(article.url)

 
legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre',
             'politique': 'http://www.legorafi.fr/category/france/politique',
             'societe': 'http://www.legorafi.fr/category/france/societe',
             'economie': 'http://www.legorafi.fr/category/france/economie',
             'culture': 'http://www.legorafi.fr/category/culture',
             'people': 'http://www.legorafi.fr/category/people',
             'sports': 'http://www.legorafi.fr/category/sports',
             'hi-tech': 'http://www.legorafi.fr/category/hi-tech',
             'sciences': 'http://www.legorafi.fr/category/sciences',
             'ledito': 'http://www.legorafi.fr/category/ledito/'
             }


for category, url in legorafi_urls.items():
   raw_html = requests.get(url)
   soup = BeautifulSoup(raw_html.text, 'html.parser')
   for articles_tags in soup.findAll('div', {'class': 'articles'}):
      for article_href in articles_tags.find_all('a', href=True):
         if not str(article_href['href']).endswith('#commentaires'):
           urls_set.add(article_href['href'])
           papers.append(article_href['href'])

   print(len(papers))
   # output
   155

If you need to obtain the articles listed in the subpages of a category page (politique currently has 120 subpages) then you would have to use something like Selenium to click the links.如果您需要获取分类页面子页面中列出的文章(politique 目前有 120 个子页面),那么您必须使用Selenium 之类的东西来单击链接。

Hopefully, this code helps you get closer to achieving your objective.希望此代码可以帮助您更接近实现目标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM