简体   繁体   中英

Searching for specific HTML string using Python

What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given. For instance, if I have an html doc that has <a href="test.html">Test</a> and I want to delete this out of every html page that has it.

Any help is much appreciated, and I don't need someone to write the program for me, just a helpful point in the right direction.

If the string you are searching for will be in the HTML literally, then simple string replacement will be fine:

old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
    open(html_file, "w").write(new_html)

As an example of the string not being in the HTML literally, suppose you are looking for "Test" as you said. Do you want it to match these snippets of HTML?:

<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
<a href="test.html" class="external">Test</a>
<a href="test.html">Tes&#116;</a>

and so on: the "same" HTML can be expressed in many different ways. If you know the precise characters used in the HTML, then simple string replacement is fine. If you need to match at an HTML semantic level, then you'll need to use more advanced tools like BeautifulSoup, but then you'll also have potentially very different HTML output than you started with, even in the sections not affected by the deletion, because the entire file will have been parsed and reconstituted.

To execute code over many files, you'll find os.path.walk useful for finding files in a tree, or glob.glob for matching filenames to shell-like wildcard patterns.

htmllib

This module defines a class which can serve as a base for parsing text files formatted in the HyperText Mark-up Language (HTML). The class is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. The HTMLParser class is designed to be used as a base class for other classes in order to add functionality, and allows most of its methods to be extended or overridden. In turn, this class is derived from and extends the SGMLParser class defined in module sgmllib. The HTMLParser implementation supports the HTML 2.0 language as described in RFC 1866.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM