简体   繁体   中英

How can I get the original URL from an archive.is short link using python?

I would like to write a function which takes an archive.is (or archive.fo , archive.li , or archive.today ) link as input and gives the URL of the original site as output.

For example, if the input was 'http://archive.is/9mIro' , then I would want the output to be 'http://www.dailytelegraph.com.au/news/nsw/australian-army-bans-male-recruits-to-get-female-numbers-up/news-story/69ee9dc1d4f8836e9cca7ca2e3e5680a' .

How can I do this in python?

Yes, your approach could work for another site, but archive.is seems to protect their data from automatic queries, when I try curl, python (urllib2) I get error Empty reply from server . You need something like phantomjs that mimic real browser. And I believe it will only work for few queries and then will show captcha or give errors. Also they seem to log ip addresses and even phantomjs get errors from same machine where curl or python was tried.

Here's phantomjs code that works:

var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';

function getOriginalUrl(shortUrl, cb) {
  page.open(shortUrl, function(status) {
    //console.log(status);
    var url = page.evaluate(function(){
      return document.querySelector('form input').value;
    });
    cb(url);
  });
}

if (args.length > 1) {
  getOriginalUrl(args[1],function(url){
    console.log(url);
    phantom.exit();
  });
} else {
  console.log('Pass url');
  phantom.exit();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM