Perl非贪婪正则表达式

Question

So I finally got my boss to approve the use of perl for this purpose as opposed to sed. 因此，我终于让我的老板批准为此目的使用perl而不是sed。

Here's the basic quandry. 这是基本的杂项。

I have lines like this: 我有这样的行：

<div class="SectionText">Sometext</div><div class="SectionText">Some more text</div>

It's terribly messy, but I didn't write it. 这太乱了，但我没有写。 Either way, there are a goodly number of pages like this and they need to be changed to this format: 无论哪种方式，都有很多这样的页面，需要将其更改为以下格式：

<p>Sometext</p><p>Some more text</p>

Obviously this needs to be non-greedy. 显然，这需要不贪心。 Now here's the line I've come up with to help with this: 现在，我想出了这条线来帮助解决这个问题：

perl -nle "s/(.*)<div class=\"SectionText\">(.*?)<\/div>(.*)/\1<p>\2<\/p>\3/ig; print $1" "somefile.html" > otherfile.html

However, this does nothing and all of the SectionText tags still remain. 但是，此操作不执行任何操作，并且所有SectionText标签仍然保留。

Answer 1

Be aware that regular expressions are far from ideal for processing HTML. 请注意，正则表达式远非适合处理HTML。 The proper way is to use a parser and manipulate the DOM, but you can get away with regexes for simple and well-behaved situations. 正确的方法是使用解析器并处理DOM，但是对于简单且行为良好的情况，您可以不使用正则表达式。 Just be aware further down the line that this is a weak point in your design and may cause unexpected problems. 只是要进一步了解这是设计的弱点，并且可能会导致意外的问题。

There is no need to capture and restore text outside the area to be edited. 无需捕获和还原要编辑区域之外的文本。 Simply replace the <div> element with a <p> element with the same content. 只需将<div>元素替换为具有相同内容的<p>元素。 There is also no need to escape double quotes or slashes as long as you choose different delimiters. 只要选择其他定界符，也无需转义双引号或斜杠。

It is also wrong to use \\1 , \\2 etc. in the replacement string. 在替换字符串中使用\\1 ， \\2等也是错误的。 $1 , $2 etc. belong here, and you would have been warned of this if you had used -w on the command line. $1 ， $2等都属于这里，如果您在命令行上使用-w ，将会被警告过。

This should work for you 这应该为你工作

perl -pe 's|<div class="SectionText">(.*?)</div>|<p>$1</p>|ig' somefile.html > otherfile.html

Answer 2

See HTML::TreeBuilder ::XPath , and HTML::Element for output methods. 有关输出方法，请参见HTML :: TreeBuilder :: XPath和HTML :: Element 。

my $t = HTML::TreeBuilder::XPath
    ->new_from_content('<div class="SectionText">Sometext</div><div class="SectionText">Some more text</div>');
for ($t->findnodes('//div[@class="SectionText"]')) {
    $_->tag('p');
    $_->attr(class => undef);
}

To make it 100% correct, the class attribute value should be split on white-space, the class name SectionText removed, and then the attribute value reassembled. 若要使其100％正确，应在空白处拆class属性值，删除类名称SectionText ，然后重新组合属性值。 I think in your case you can get away with just deleting the class attribute as in the code above. 我认为在您的情况下，您可以删除上面的代码中的class属性，从而摆脱困境。

Perl非贪婪正则表达式

问题描述

2 个解决方案

解决方案1
6 已采纳 2012-03-22 13:38:12

解决方案2
4 2012-03-22 13:44:44

Perl非贪婪正则表达式

问题描述

2 个解决方案

解决方案1 6 已采纳 2012-03-22 13:38:12

解决方案2 4 2012-03-22 13:44:44

解决方案1
6 已采纳 2012-03-22 13:38:12

解决方案2
4 2012-03-22 13:44:44