简体   繁体   English

使用htmlparse替换html文件中的图像和css源URL(python)

[英]Using htmlparse to replace image and css source urls in a html file (python)

I'm trying to write a script that will download a webpage, including all the images and style sheets - ie so a locally hosted version looks identical to the original. 我正在尝试编写一个可下载网页的脚本,包括所有图像和样式表 - 即本地托管版本看起来与原始版本相同。

Originally I was just downloading the images, but I realize now that I have to (of course) edit the html source so that the img src actually points to the locally hosted image. 最初我只是下载图像,但我现在意识到我必须(当然)编辑html源代码,以便img src实际指向本地托管的图像。 As I have to change the html source anyway, I decided it was better if I just updated the locally hosted file to point to the images and style sheets hosted remotely. 因为我必须改变html源代码,我认为如果我只更新本地托管文件以指向远程托管的图像和样式表,那就更好了。

So this brings me to my question, can I use htmlparse to search for the style sheets and image tags and then replace the links to them with the updated versions? 所以这让我想到了我的问题,我可以使用htmlparse搜索样式表和图像标签,然后用更新的版本替换它们的链接吗?

I've had a look at the htmlparse documentation, but I'm still pretty new to python so some parts unclear. 我已经看过htmlparse文档,但我仍然是python的新手,所以有些部分不清楚。 I thought it might be possible to use: 我认为有可能使用:

HTMLParser.handle_data(data)
This method is called to process arbitrary data. It is intended to be overridden by a 
derived class; the base class implementation does nothing.

and add my own replacing class to it? 并添加我自己的替换类? Or am I on totally the wrong lines? 还是我完全错了?

Another option of course would be to use regular expressions to search for the tags and replace the text after them, but this could get pretty complicated so I was wondering if htmlparse would provide a simpler solution. 另一个选择当然是使用正则表达式来搜索标签并替换它们之后的文本,但这可能变得非常复杂,所以我想知道htmlparse是否会提供更简单的解决方案。

I realize that beautiful soup would be the ideal solution, but I will be distributing the finished tool around my company, so I can't use any third party modules unfortunately. 我意识到美丽的汤将是理想的解决方案,但我将在我的公司周围分发完成的工具,所以我不能使用任何第三方模块。 Similarly I'd like the tool to be platform independent, so unfortunately cannot use wget. 同样地,我希望该工具与平台无关,所以不幸的是不能使用wget。

Thanks for any input =) 感谢任何输入=)

如果将Python程序打包成自包含的二进制文件(甚至不需要Python运行时),您可以使用任何模块到您的内容: http//www.pyinstaller.org/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM