简体   繁体   中英

Newspaper python cache issue, every call same output

I use this module: https://github.com/codelucas/newspaper to download bitcoin articles from https://news.bitcoin.com/ . But when I try to get next articles from next page ' https://news.bitcoin.com/page/2/page ' I get same output. Same for any other page.

I have tried with different sites and different starting pages. The articles from first link I used were displayed on all other links.

import newspaper

url = 'https://news.bitcoin.com/page/2'
btc_articles = newspaper.build(url, memoize_articles = False)

for article in btc_articles.articles:
    print(article.url)

The newspaper library tries to scrape the whole website, not just the link you input. This means that you shouldn't have to loop through all pages to the get the articles. However, as you might have noted the lib doesn't find all articles anyway.

The reason for this seems to be that it doesn't identify all pages as categories (and doesn't find the feed), see below (the output was the same regardless of page):

import newspaper

url = 'https://news.bitcoin.com/'
btc_paper = newspaper.build(url, memoize_articles = False)

print('Categories:', [category.url for category in btc_paper.categories])
print('Feeds:', [feed.url for feed in btc_paper.feeds])

Output:

Categories: ['https://news.bitcoin.com/page/2', 'https://news.bitcoin.com']
Feeds: []

This is seems to be a bug in the code (or bad website design on bitcoins part depending on how you look at it), just as you have noted in your trouble report https://github.com/codelucas/newspaper/issues/670 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM