Python: How to replace all relative urls in a document with absolute urls

Question

I am writing an app for Google App Engine that fetches the content of a url and then writes the content of that external url to the local page. I am able to do this, but the obvious issue is that the relative urls point to non-existent pages. I'm not very experienced with python so writing code like this on my own would probably take years.

Here's my code so far:

url = "http://www.google.com/"
try:
  result = urllib2.urlopen(url)
  self.response.out.write(result.read())
except urllib2.URLError, e:
  self.response.out.write(e)

Note: I'm not creating a malicious app.

Answer 1

The URLs would be relative to the base URL of the page you are looking at. So you need to get that base passed into your backend python code. You could use document.URL if you are calling your python from Javascript.

Or, possibly, self.request.referer will be useful to you.

The answer depends on where the relative URLs are coming from and how you are calling your python, it's not clear from your question.

Answer 2

I can tell you broadly what you'll need to do, but unfortunately, it's a little complicated and you're probably not going to like it. Python defines a very generic template class called html.parser for doing exactly this sort of thing. The class defines a feed() method which provides the main point of access for an end user such as yourself. The feed() method rips through the raw html, and as it encounters different html markup items, it calls different "handler" methods for processing each one. You actually use the class by overriding these "handler" methods, most of which are empty (ie, they simply return without doing anything) by default. The link that I included above provides some example code demonstrating how to implement this override for trivial cases.

For most of the handler methods, you will override the empty default logic by simply telling the handler to print whatever item it encounters, perhaps with an additional "<" or "\\" or ">" character printed at the beginning or end as appropriate (the parser strips these out by default). In this way, you will cause the parser to simply write out the same html code again just exactly as it encountered it. But for one of the handler methods, specifically the handle_starttag() method, you will have to provide some additional logic so that when you encounter an "A" tag with an attribute keyed by "HREF", you inspect the value associated with the "HREF" key, and then substitute a full URL address rather than a relative address if required.

Python: How to replace all relative urls in a document with absolute urls

Question

2 answers

solution1
0 2013-12-28 08:39:46

solution2
0 ACCPTED 2013-12-28 20:11:12

Python: How to replace all relative urls in a document with absolute urls

Question

2 answers

solution1 0 2013-12-28 08:39:46

solution2 0 ACCPTED 2013-12-28 20:11:12

solution1
0 2013-12-28 08:39:46

solution2
0 ACCPTED 2013-12-28 20:11:12