简体   繁体   中英

Python Newspaper with web archive (wayback machine)

I'm trying to use the Python library newspaper with the archives from the Wayback Machine , which stores old versions of websites that were archived. Theoretically, old news articles could be queried and downloaded from these archives.

For instance, the follow code queries the archives for CNBC for a specific archive date.

import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )

Although the archived website itself contains links to actual news articles from 2016-12-01, the newspaper module does not seem to pick them up. Instead, you get urls such as:

https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/

which are not actual articles from this archived version of CNBC. However, newspaper works great with today 's version of CNBC .

I suppose that it gets confused because of the format of the url (which contains two http s). Does anyone have any suggestions on how to extract articles from the Wayback Machine archives?

This was an interesting problem, which I will add to my Newspaper Usage Overview document available on GitHub.

I attempted to use newspaper.build , but I couldn't get it to work correctly, so I used newspaper Source.

from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
                  memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)

wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
   article_extract.download()
   article_extract.parse()

   print(article_extract.publish_date)
   print(article_extract.title)
   print(article_extract.url)
   print('')

   # this sleep timer is helping with some timeout issues
   # that were happening when querying
   sleep(randint(1,3))

The example above outputs this:

None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
    
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/

2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html

2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points 
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html

2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
     

Hopefully, this answer helps with your use case for querying the WayBack Machine for articles. If you have any questions please let me know.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM