简体   繁体   English

仅在Perl中的正则表达式匹配中替换字符串

[英]Replace strings only within a regex match in perl

I have an XML document with text in attribute values. 我有一个XML文档,其属性值中包含文本。 I can't change how the the XML file is generated, but need to extract the attribute values without loosing \\r\\n. 我无法更改XML文件的生成方式,但是需要提取属性值而不丢失\\ r \\ n。 The XML parser of course strips them out. XML解析器当然会将它们剥离。

So I'm trying to replace \\r\\n in attribute values with entity references I'm using perl to do this because of it's non-greedy matching. 所以我试图用实体引用替换属性值中的\\ r \\ n,因为它是非贪婪的匹配,所以我使用perl来做到这一点。 But I need help getting the replace to happen only within the match. 但是我需要帮助才能让替换只在比赛中进行。 Or I need an easier way to do this :) 或者我需要一个更简单的方法来做到这一点:)

Here's is what I have so far: 这是我到目前为止的内容:

perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml

This matches what I need to work with: (.*?). 这与我需要使用的内容相匹配:(。*?)。 But I don't know to expand that pattern to match \\r\\n inside it, and do the replacement in the results. 但是我不知道扩展该模式以匹配其中的\\ r \\ n并在结果中进行替换。 If I knew how many \\r\\n I have I could do it, but it seems I need a variable number of capture groups or something like that? 如果我知道我可以有多少个\\ r \\ n,但是似乎我需要可变数量的捕获组或类似的东西? There's a lot to regex I don't understand and it seems like there should be something do do this. 正则表达式有很多我不了解的地方,似乎应该做些什么。

Example: 例:

preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Should go to: 应该去:

preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Solution

Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is: 感谢Ikegam和ysth提供的解决方案,对于5.14+,它是:

perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n!
!gr =~ s!\r!
!gr /sge' tmp.xml

. should already match \\n (because you specify the /s flag) and \\r . 应该已经匹配\\n (因为您指定了/s标志)和\\r

To do the replacement in the results, use /e : 要替换结果,请使用/e

perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n!&#10;!g; $replacement=~s!\r!&#13;!g; $replacement /sge' tmp.xml

I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/ to slurp mode and to remove the useless /m . 我还更改了它以使用lookbehind / lookahead简化代码,并使用-0777将$/设置$/ slurp模式并删除无用的/m

OK, so whilst this looks like an XML problem, it isn't. 好的,虽然这看起来像是XML问题,但事实并非如此。 The XML problem is the person generating it. XML问题是生成它的人。 You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this. 您可能应该给他们一个带有规格汇总的产品,作为“修复”此问题的第一个呼吁。

But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all. 否则,我将采用两次通过方法,即阅读文本,找到与描述匹配的所有“斑点”,然后全部替换。

Something like this: 像这样:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

my $text = do { local $/ ;  <DATA> }; 

#filter text for 'description' text: 
my @matches = $text =~ m{m_description=\"([^\"]+)\"}gms;

print Dumper \@matches; 

#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/&#13;&#10;/gr } @matches; 
print Dumper \%replace;

#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace ); 
   $search = qr/\"($search)\"/ms; 

print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;

print "New text:\n";
print $text;

__DATA__
preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM