简体   繁体   English

Perl非贪婪正则表达式

[英]Perl non-greedy Regex

So I finally got my boss to approve the use of perl for this purpose as opposed to sed. 因此,我终于让我的老板批准为此目的使用perl而不是sed。

Here's the basic quandry. 这是基本的杂项。

I have lines like this: 我有这样的行:

<div class="SectionText">Sometext</div><div class="SectionText">Some more text</div>

It's terribly messy, but I didn't write it. 这太乱了,但我没有写。 Either way, there are a goodly number of pages like this and they need to be changed to this format: 无论哪种方式,都有很多这样的页面,需要将其更改为以下格式:

<p>Sometext</p><p>Some more text</p>

Obviously this needs to be non-greedy. 显然,这需要不贪心。 Now here's the line I've come up with to help with this: 现在,我想出了这条线来帮助解决这个问题:

perl -nle "s/(.*)<div class=\"SectionText\">(.*?)<\/div>(.*)/\1<p>\2<\/p>\3/ig; print $1" "somefile.html" > otherfile.html

However, this does nothing and all of the SectionText tags still remain. 但是,此操作不执行任何操作,并且所有SectionText标签仍然保留。

Be aware that regular expressions are far from ideal for processing HTML. 请注意,正则表达式远非适合处理HTML。 The proper way is to use a parser and manipulate the DOM, but you can get away with regexes for simple and well-behaved situations. 正确的方法是使用解析器并处理DOM,但是对于简单且行为良好的情况,您可以不使用正则表达式。 Just be aware further down the line that this is a weak point in your design and may cause unexpected problems. 只是要进一步了解这是设计的弱点,并且可能会导致意外的问题。

There is no need to capture and restore text outside the area to be edited. 无需捕获和还原要编辑区域之外的文本。 Simply replace the <div> element with a <p> element with the same content. 只需将<div>元素替换为具有相同内容的<p>元素。 There is also no need to escape double quotes or slashes as long as you choose different delimiters. 只要选择其他定界符,也无需转义双引号或斜杠。

It is also wrong to use \\1 , \\2 etc. in the replacement string. 在替换字符串中使用\\1\\2等也是错误的。 $1 , $2 etc. belong here, and you would have been warned of this if you had used -w on the command line. $1$2等都属于这里,如果您在命令行上使用-w ,将会被警告过。

This should work for you 这应该为你工作

perl -pe 's|<div class="SectionText">(.*?)</div>|<p>$1</p>|ig' somefile.html > otherfile.html

See HTML::TreeBuilder ::XPath , and HTML::Element for output methods. 有关输出方法,请参见HTML :: TreeBuilder :: XPathHTML :: Element

my $t = HTML::TreeBuilder::XPath
    ->new_from_content('<div class="SectionText">Sometext</div><div class="SectionText">Some more text</div>');
for ($t->findnodes('//div[@class="SectionText"]')) {
    $_->tag('p');
    $_->attr(class => undef);
}

To make it 100% correct, the class attribute value should be split on white-space, the class name SectionText removed, and then the attribute value reassembled. 若要使其100%正确,应在空白处拆class属性值,删除类名称SectionText ,然后重新组合属性值。 I think in your case you can get away with just deleting the class attribute as in the code above. 我认为在您的情况下,您可以删除上面的代码中的class属性,从而摆脱困境。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM