简体   繁体   English

如何用带有类的单个标签替换嵌套的块引用标签?

[英]How to replace nested blockquote tags with single tag with class?

I inherited several thousand messy HTML files that use repeated blockquote tags to display lines of a poetry.我继承了几千凌乱的HTML文件使用重复blockquote标签显示的诗行。

Example:例子:

 <blockquote><blockquote>roses are red</blockquote></blockquote><br> <blockquote><blockquote><blockquote>violets are blue</blockquote></blockquote></blockquote><br> <blockquote><blockquote>this is another line</blockquote></blockquote><br> <blockquote><blockquote><blockquote>and this is too</blockquote></blockquote></blockquote><br>

For lines of free verse, you'll see as many as 7-8 block quote tags wrapping a line of text.对于自由诗行,您会看到多达 7 到 8 个块引用标记包裹一行文本。 I want to replace the set of nested blockquote tags with a single <p> or <span> tag and give it a class such as “indent-7” or “indent-8.”我想用单个<p><span>标记替换嵌套的块blockquote标记集,并为其指定一个类,例如“indent-7”或“indent-8”。

There is unpredictable white space between the blockquote tags. blockquote标签之间有不可预测的空白。 Some have spaces between them, some are separated by new lines.有些之间有空格,有些用新行分隔。 I'm thinking Python's BeautifulSoup is the way to handle this task.我认为 Python 的 BeautifulSoup 是处理此任务的方法。

How can I replace the nested blockquote tags with a single tag with a class of “n” where n is the number of tags that were nested?如何使用类为“n”的单个标记替换嵌套的blockquote标记,其中 n 是嵌套的标记数?

This how I would approach it with lxml:这是我将如何使用 lxml 来处理它:

(Note that I added a line to the poem, to test for tags separated by space.) (请注意,我在诗中添加了一行,以测试以空格分隔的标签。)

poem = """
<doc>
  <blockquote><blockquote>roses are red</blockquote></blockquote><br/>
  <blockquote>     <blockquote>roses are green</blockquote></blockquote><br/>
       <blockquote>
         <blockquote><blockquote>violets are blue</blockquote></blockquote></blockquote><br/>
    <blockquote><blockquote>this is another line</blockquote></blockquote><br/>
    <blockquote><blockquote><blockquote>and this is too</blockquote></blockquote></blockquote><br/>

</doc>
"""

doc = lxml.html.fromstring(poem)
targ = doc.xpath('//text()[normalize-space(.)]')
for t in targ:
    count = int(t.getparent().xpath("count(.//ancestor::*[name()='blockquote'])"))
    print(f'<blockquote indent="{count}">{t}<</blockquote>')

Output:输出:

<blockquote indent="2">roses are red<</blockquote>
<blockquote indent="2">roses are green<</blockquote>
<blockquote indent="3">violets are blue<</blockquote>
<blockquote indent="2">this is another line<</blockquote>
<blockquote indent="3">and this is too<</blockquote>

Just for good measure (and for the benefit of future readers), this is how I would do it with xquery:只是为了更好的衡量(以及为了未来读者的利益),这就是我将如何使用 xquery:

let $j := <doc>
...text of poem above... 
</doc>

for $targ in $j//text()[normalize-space(.)] 

let $line := $targ/data(.) 
let $count := count($targ/ancestor::blockquote)
return 
<blockquote nested="{$count}">{$line}</blockquote>

Same output.相同的输出。

You could remove whitespace and new linrs manually.您可以手动删除空格和新的 linrs。 Once that's removed, it should make the job easier.一旦删除,它应该使工作更容易。

Assuming that's not an option to consider, you can use PHP for that假设这不是要考虑的选项,您可以使用 PHP

$html = preg_replace('~>\\s+<~m', '><', $html);

Now to replace the blockquotes you can even use Notepad++ find/replace functionality, you just need to find a pattern.现在要替换块引号,您甚至可以使用 Notepad++ 查找/替换功能,您只需要找到一个模式。 For instances, with the current code you have, there's two or three blockquotes maximum.例如,使用当前的代码,最多有两个或三个块引用。 So, in Notepad++ you'll need to do four operations find/replace all因此,在 Notepad++ 中,您需要执行四次查找/替换所有操作

  • search for <blockquote><blockquote> and replace with <p> (or span as you will)搜索<blockquote><blockquote>并替换为<p> (或按您的意愿跨度)
  • search for </blockquote></blockquote> and replace with </p> (or span as you will)搜索</blockquote></blockquote>并替换为</p> (或按您的意愿跨度)
  • search for <blockquote><blockquote><blockquote> and replace with <p> (or span as you will)搜索<blockquote><blockquote><blockquote>并替换为<p> (或按您的意愿跨度)
  • search for </blockquote></blockquote></blockquote> and replace with </p> (or span as you will)搜索</blockquote></blockquote></blockquote>并替换为</p> (或按您的意愿跨度)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM