简体   繁体   English

Python:如何用绝对URL替换文档中的所有相对URL

[英]Python: How to replace all relative urls in a document with absolute urls

I am writing an app for Google App Engine that fetches the content of a url and then writes the content of that external url to the local page. 我正在为Google App Engine编写一个应用程序,该应用程序获取URL的内容,然后将该外部URL的内容写入本地页面。 I am able to do this, but the obvious issue is that the relative urls point to non-existent pages. 我能够做到这一点,但显而易见的问题是相对URL指向不存在的页面。 I'm not very experienced with python so writing code like this on my own would probably take years. 我对python不太有经验,所以我自己编写这样的代码可能要花几年的时间。

Here's my code so far: 到目前为止,这是我的代码:

url = "http://www.google.com/"
try:
  result = urllib2.urlopen(url)
  self.response.out.write(result.read())
except urllib2.URLError, e:
  self.response.out.write(e)

Note: I'm not creating a malicious app. 注意:我不是在创建恶意应用。

The URLs would be relative to the base URL of the page you are looking at. 这些URL相对于您正在查看的页面的基本URL。 So you need to get that base passed into your backend python code. 因此,您需要将该基础传递到后端python代码中。 You could use document.URL if you are calling your python from Javascript. 如果要通过Javascript调用python,则可以使用document.URL。

Or, possibly, self.request.referer will be useful to you. 或者,也许self.request.referer对您有用。

The answer depends on where the relative URLs are coming from and how you are calling your python, it's not clear from your question. 答案取决于相对URL的来源以及您如何调用python,目前尚不清楚。

I can tell you broadly what you'll need to do, but unfortunately, it's a little complicated and you're probably not going to like it. 我可以大致告诉您您需要做什么,但是不幸的是,这有点复杂,您可能不会喜欢它。 Python defines a very generic template class called html.parser for doing exactly this sort of thing. Python定义了一个非常通用的模板类html.parser来完成这种事情。 The class defines a feed() method which provides the main point of access for an end user such as yourself. 该类定义了feed()方法,该方法为最终用户(例如您自己)提供访问的主要点。 The feed() method rips through the raw html, and as it encounters different html markup items, it calls different "handler" methods for processing each one. feed()方法会遍历原始html,并且遇到不同的html标记项时,会调用不同的“处理程序”方法来处理每个方法。 You actually use the class by overriding these "handler" methods, most of which are empty (ie, they simply return without doing anything) by default. 实际上,您可以通过覆盖这些“处理程序”方法来使用该类,默认情况下,这些方法大多数是空的(即,它们只是返回而没有执行任何操作)。 The link that I included above provides some example code demonstrating how to implement this override for trivial cases. 我上面包含的链接提供了一些示例代码,演示了如何在平凡的情况下实现此替代。

For most of the handler methods, you will override the empty default logic by simply telling the handler to print whatever item it encounters, perhaps with an additional "<" or "\\" or ">" character printed at the beginning or end as appropriate (the parser strips these out by default). 对于大多数处理程序方法,您将通过简单地告诉处理程序打印遇到的任何项目来覆盖空的默认逻辑,可能在适当的时候在开头或结尾处加上一个额外的“ <”或“ \\”或“>”字符(默认情况下,解析器会将其删除)。 In this way, you will cause the parser to simply write out the same html code again just exactly as it encountered it. 这样,您将使解析器再次简单地再次写出相同的html代码即可。 But for one of the handler methods, specifically the handle_starttag() method, you will have to provide some additional logic so that when you encounter an "A" tag with an attribute keyed by "HREF", you inspect the value associated with the "HREF" key, and then substitute a full URL address rather than a relative address if required. 但是对于其中一种处理程序方法,特别是handle_starttag()方法,您将必须提供一些其他逻辑,以便当遇到带有“ HREF”键属性的“ A”标签时,您可以检查与“ HREF”键,然后替换完整的URL地址,而不是相对地址(如果需要)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM