简体   繁体   中英

Simple .html filter in python - modify text elements only

I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.

One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p> <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p> <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p> .

I can easily parse my files with html.parser , but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).

I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.

Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.

UPDATE:

Following @soundstripe advice Iwrote the following code:

import bs4
from re import sub

def handle_html(html):
    sp = bs4.BeautifulSoup(html, features='html.parser')
    for e in list(sp.strings):
        s = sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)
        if s != e:
            e.replace_with(s)
    return str(sp).encode()

raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)

Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:

b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &amp;ldquo;find&amp;rdquo; his &amp;ldquo;good&amp;rdquo; side! He has <i>none</i>!<div></div></div></p>'

ie: it transforms plain & to &amp; thus breaking &ldquo; entity (notice I'm working with bytearrays, not strings. Is it relevant?).

How can I fix this?

I don't know why you wouldn't use BeautifulSoup. Here's an example that replaces your quotes like you're asking.

import re
import bs4

raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')

def replace_quotes(s):
    return re.sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)


for e in list(soup.strings):
    # wrapping the new string in BeautifulSoup() call to correctly parse entities
    new_string = bs4.BeautifulSoup(replace_quotes(e))
    e.replace_with(new_string)

# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')


print(raw)
print(new)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM