简体   繁体   中英

How do you replace specific characters in beautifulSoup?

I'm well aware on how to replace texts in tags using bs4 but how would I actually change a specific character in, say a p-tag, into another character or string enclosed in a b-tag?

An example would be if I wanted to bold/highlight all the j's in a paragraph.

If you want to insert tags into text, you'll have to break up the whole text into 3 pieces; everything before, the text going into the tag, and everything after.

This has to be done every time you find a match in the text, so you need to keep track of the end piece after insertion too:

def inject_tag(text, start, end, tagname, **attrs):
    # find the document root
    root = text
    while root.parent:
        root = root.parent

    before = root.new_string(text[:start])
    new_tag = root.new_tag(tagname, **attrs)
    new_tag.string = text[start:end]
    after = root.new_string(text[end:])

    text.replace_with(before)
    before.insert_after(new_tag)
    new_tag.insert_after(after)
    return after

Then use the above function to replace specific indices:

>>> import re
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <p>The quick brown fox jumps over the lazy dog</p>
... ''')
>>> the = re.compile(r'the', flags=re.I)
>>> text = soup.p.string
>>> while True:
...     match = the.search(unicode(text))
...     if not match: break
...     start, stop = match.span()
...     text = inject_tag(text, start, stop, 'b')
... 
>>> print soup.prettify()
<html>
 <head>
 </head>
 <body>
  <p>
   <b>
    The
   </b>
   quick brown fox jumps over
   <b>
    the
   </b>
   lazy dog
  </p>
 </body>
</html>

You can use find_all() function to get all <p> elements an a regular expession to inject <b> elements for the letter you wish, like:

from bs4 import BeautifulSoup
import sys 
import re

soup = BeautifulSoup(open(sys.argv[1]))
for p in soup.find_all('p'):
    p.string = re.sub(r'(r)', r'<b>\1</b>', p.string)

print(soup.prettify(formatter=None))

Note that I use formatter=None to avoid the conversion of HTML entities.

Using this test text:

<div>
    <div class="post-text" itemprop="text">

        <p>I'm well aware on how to replace texts in tags using bs4 but how would I actually change a specific character in, say a p-tag, into another character or string enclosed in a b-tag?</p>

<p>An example would be if I wanted to bold/highlight all the j's in a paragraph.</p>

    </div>
</div>

Run it like:

python script.py infile 

That yields:

<html>
 <body>
  <div>
   <div class="post-text" itemprop="text">
    <p>
     I'm well awa<b>r</b>e on how to <b>r</b>eplace texts in tags using bs4 but how would I actually change a specific cha<b>r</b>acte<b>r</b> in, say a p-tag, into anothe<b>r</b> cha<b>r</b>acte<b>r</b> o<b>r</b> st<b>r</b>ing enclosed in a b-tag?
    </p>
    <p>
     An example would be if I wanted to bold/highlight all the j's in a pa<b>r</b>ag<b>r</b>aph.
    </p>
   </div>
  </div>
 </body>
</html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM