How to replace nested blockquote tags with single tag with class?

Question

I inherited several thousand messy HTML files that use repeated blockquote tags to display lines of a poetry.

Example:

 <blockquote><blockquote>roses are red</blockquote></blockquote><br> <blockquote><blockquote><blockquote>violets are blue</blockquote></blockquote></blockquote><br> <blockquote><blockquote>this is another line</blockquote></blockquote><br> <blockquote><blockquote><blockquote>and this is too</blockquote></blockquote></blockquote><br>

For lines of free verse, you'll see as many as 7-8 block quote tags wrapping a line of text. I want to replace the set of nested blockquote tags with a single  or  tag and give it a class such as “indent-7” or “indent-8.”

There is unpredictable white space between the blockquote tags. Some have spaces between them, some are separated by new lines. I'm thinking Python's BeautifulSoup is the way to handle this task.

How can I replace the nested blockquote tags with a single tag with a class of “n” where n is the number of tags that were nested?

Answer 1

This how I would approach it with lxml:

(Note that I added a line to the poem, to test for tags separated by space.)

poem = """
<doc>
  <blockquote><blockquote>roses are red</blockquote></blockquote><br/>
  <blockquote>     <blockquote>roses are green</blockquote></blockquote><br/>
       <blockquote>
         <blockquote><blockquote>violets are blue</blockquote></blockquote></blockquote><br/>
    <blockquote><blockquote>this is another line</blockquote></blockquote><br/>
    <blockquote><blockquote><blockquote>and this is too</blockquote></blockquote></blockquote><br/>

</doc>
"""

doc = lxml.html.fromstring(poem)
targ = doc.xpath('//text()[normalize-space(.)]')
for t in targ:
    count = int(t.getparent().xpath("count(.//ancestor::*[name()='blockquote'])"))
    print(f'<blockquote indent="{count}">{t}<</blockquote>')

Output:

<blockquote indent="2">roses are red<</blockquote>
<blockquote indent="2">roses are green<</blockquote>
<blockquote indent="3">violets are blue<</blockquote>
<blockquote indent="2">this is another line<</blockquote>
<blockquote indent="3">and this is too<</blockquote>

Just for good measure (and for the benefit of future readers), this is how I would do it with xquery:

let $j := <doc>
...text of poem above... 
</doc>

for $targ in $j//text()[normalize-space(.)] 

let $line := $targ/data(.) 
let $count := count($targ/ancestor::blockquote)
return 
<blockquote nested="{$count}">{$line}</blockquote>

Same output.

Answer 2

You could remove whitespace and new linrs manually. Once that's removed, it should make the job easier.

Assuming that's not an option to consider, you can use PHP for that

$html = preg_replace('~>\\s+<~m', '><', $html);

Now to replace the blockquotes you can even use Notepad++ find/replace functionality, you just need to find a pattern. For instances, with the current code you have, there's two or three blockquotes maximum. So, in Notepad++ you'll need to do four operations find/replace all

search for <blockquote><blockquote> and replace with  (or span as you will)
search for </blockquote></blockquote> and replace with  (or span as you will)
search for <blockquote><blockquote><blockquote> and replace with  (or span as you will)
search for </blockquote></blockquote></blockquote> and replace with  (or span as you will)

How to replace nested blockquote tags with single tag with class?

Question

2 answers

solution1
1 2020-03-12 15:13:45

solution2
-1 2020-03-09 09:14:41

How to replace nested blockquote tags with single tag with class?

Question

2 answers

solution1 1 2020-03-12 15:13:45

solution2 -1 2020-03-09 09:14:41

solution1
1 2020-03-12 15:13:45

solution2
-1 2020-03-09 09:14:41