简体   繁体   English

使用Python搜索特定的HTML字符串

[英]Searching for specific HTML string using Python

What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given. 什么模块是编写Python程序的最佳方式,该程序可以搜索数百个html文档并删除给定的html字符串。 For instance, if I have an html doc that has <a href="test.html">Test</a> and I want to delete this out of every html page that has it. 例如,如果我有一个<a href="test.html">Test</a>的html文档,而我想从具有该文档的每个html页面中删除它。

Any help is much appreciated, and I don't need someone to write the program for me, just a helpful point in the right direction. 非常感谢您的帮助,我不需要有人为我编写程序,只是在正确方向上提供了帮助。

If the string you are searching for will be in the HTML literally, then simple string replacement will be fine: 如果您要搜索的字符串原样位于HTML中,那么简单的字符串替换就可以了:

old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
    open(html_file, "w").write(new_html)

As an example of the string not being in the HTML literally, suppose you are looking for "Test" as you said. 作为字符串实际不在HTML中的示例,假设您正在按照您所说的那样查找“ Test”。 Do you want it to match these snippets of HTML?: 您是否希望它与这些HTML片段匹配?:

<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
<a href="test.html" class="external">Test</a>
<a href="test.html">Tes&#116;</a>

and so on: the "same" HTML can be expressed in many different ways. 依此类推:“相同” HTML可以用许多不同的方式表示。 If you know the precise characters used in the HTML, then simple string replacement is fine. 如果您知道HTML中使用的精确字符,那么简单的字符串替换就可以了。 If you need to match at an HTML semantic level, then you'll need to use more advanced tools like BeautifulSoup, but then you'll also have potentially very different HTML output than you started with, even in the sections not affected by the deletion, because the entire file will have been parsed and reconstituted. 如果需要在HTML语义级别进行匹配,则需要使用更高级的工具(例如BeautifulSoup),但即使在不受删除影响的部分中,您也可能会获得与开始时非常不同的HTML输出。 ,因为整个文件将被解析并重构。

To execute code over many files, you'll find os.path.walk useful for finding files in a tree, or glob.glob for matching filenames to shell-like wildcard patterns. 要对许多文件执行代码,您会发现os.path.walk可用于在树中查找文件,或者glob.glob用于将文件名匹配到类似shell的通配符模式。

htmllib htmllib

This module defines a class which can serve as a base for parsing text files formatted in the HyperText Mark-up Language (HTML). 该模块定义了一个类,该类可以用作解析以超文本标记语言(HTML)格式化的文本文件的基础。 The class is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. 该类与I / O没有直接关系,它必须通过方法以字符串形式提供输入,并调用“格式化程序”对象的方法以产生输出。 The HTMLParser class is designed to be used as a base class for other classes in order to add functionality, and allows most of its methods to be extended or overridden. HTMLParser类旨在用作其他类的基类,以增加功能,并允许扩展或覆盖其大多数方法。 In turn, this class is derived from and extends the SGMLParser class defined in module sgmllib. 反过来,该类又继承自sgmllib模块中定义的SGMLParser类。 The HTMLParser implementation supports the HTML 2.0 language as described in RFC 1866. HTMLParser实现支持RFC 1866中描述的HTML 2.0语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM