[英]Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
In Perl, how can I use one regex grouping to capture more than one occurrence that matches it, into several array elements? 在Perl中,如何使用一个正则表达式分组来捕获多个匹配它的事件到多个数组元素?
For example, for a string: 例如,对于字符串:
var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello
to process this with code: 用代码处理这个:
$string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";
my @array = $string =~ <regular expression here>
for ( my $i = 0; $i < scalar( @array ); $i++ )
{
print $i.": ".$array[$i]."\n";
}
I would like to see as output: 我想看看输出:
0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello
What would I use as a regex? 我会用什么作为正则表达式?
The commonality between things I want to match here is an assignment string pattern, so something like: 我想在这里匹配的东西之间的共性是一个赋值字符串模式,所以类似于:
my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;
Where the * indicates one or more occurrences matching the group. 其中*表示与该组匹配的一个或多个事件。
(I discounted using a split() as some matches contain spaces within themselves (ie var3...) and would therefore not give desired results.) (我使用split()打折,因为有些匹配本身包含空格(即var3 ...),因此不会产生预期的结果。)
With the above regex, I only get: 有了上面的正则表达式,我只得到:
0: var1=100 var2
Is it possible in a regex? 正则表达式是否可能? Or addition code required? 还是需要添加代码?
Looked at existing answers already, when searching for "perl regex multiple group" but not enough clues: 在查找“perl regex multiple group”但没有足够的线索时,已经查看了现有的答案:
my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";
while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {
print "<$1> => <$2>\n";
}
Prints: 打印:
<var1> => <100>
<var2> => <90>
<var5> => <hello>
<var3> => <"a, b, c">
<var7> => <test>
<var3> => <hello>
Explanation: 说明:
Last piece first: the g
flag at the end means that you can apply the regex to the string multiple times. 最后一块:最后的g
标志意味着你可以多次将正则表达式应用于字符串。 The second time it will continue matching where the last match ended in the string. 第二次它将继续匹配最后一个匹配在字符串中结束的位置。
Now for the regex: (?:^|\\s+)
matches either the beginning of the string or a group of one or more spaces. 现在对于正则表达式: (?:^|\\s+)
匹配字符串的开头或一个或多个空格的组。 This is needed so when the regex is applied next time, we will skip the spaces between the key/value pairs. 这是必需的,所以当下次应用正则表达式时,我们将跳过键/值对之间的空格。 The ?:
means that the parentheses content won't be captured as group (we don't need the spaces, only key and value). ?:
表示括号内容不会被捕获为组(我们不需要空格,只需要键和值)。 \\S+
matches the variable name. \\S+
匹配变量名称。 Then we skip any amount of spaces and an equal sign in between. 然后我们跳过任意数量的空格和两者之间的等号。 Finally, ("[^"]*"|\\S*)/
matches either two quotes with any amount of characters in between, or any amount of non-space characters for the value. Note that the quote matching is pretty fragile and won't handle escpaped quotes properly, eg "\\"quoted\\""
would result in "\\"
. 最后, ("[^"]*"|\\S*)/
匹配两个引号,其中包含任意数量的字符,或者该值的任意数量的非空格字符。请注意,引用匹配非常脆弱并且赢了正确处理escpaped引号,例如"\\"quoted\\""
将导致"\\"
。
EDIT: 编辑:
Since you really want to get the whole assignment, and not the single keys/values, here's a one-liner that extracts those: 既然你真的想得到整个作业,而不是单个键/值,这里有一个单行提取:
my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;
With regular expressions, use a technique that I like to call tack-and-stretch: anchor on features you know will be there (tack) and then grab what's between (stretch). 使用正则表达式,使用我喜欢称为弹力和伸展的技术:锚定在你知道将要存在的特征(大头钉)上,然后抓住(拉伸)之间的内容。
In this case, you know that a single assignment matches 在这种情况下,您知道单个分配匹配
\b\w+=.+
and you have many of these repeated in $string
. 你在$string
重复了很多这些。 Remember that \\b
means word boundary: 请记住\\b
表示单词边界:
A word boundary (
\\b
) is a spot between two characters that has a\\w
on one side of it and a\\W
on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a\\W
. 单词边界(\\b
)是两个字符之间的一个点,在它的一边有一个\\w
,另一边有一个\\W
(按任意顺序),计算虚构字符的开头和结尾。字符串匹配\\W
The values in the assignments can be a little tricky to describe with a regular expression, but you also know that each value will terminate with whitespace—although not necessarily the first whitespace encountered!—followed by either another assignment or end-of-string. 使用正则表达式描述赋值中的值可能有点棘手,但您也知道每个值都将以空格终止 - 尽管不一定是遇到的第一个空格! - 跟随另一个赋值或字符串结尾。
To avoid repeating the assertion pattern, compile it once with qr//
and reuse it in your pattern along with a look-ahead assertion (?=...)
to stretch the match just far enough to capture the entire value while also preventing it from spilling into the next variable name. 为了避免重复断言模式,使用qr//
编译一次并在模式中重用它以及前瞻断言(?=...)
以将匹配拉伸到足以捕获整个值,同时还防止它从溢出到下一个变量名称。
Matching against your pattern in list context with m//g
gives the following behavior: 使用m//g
匹配列表上下文中的模式会产生以下行为:
The
/g
modifier specifies global pattern matching—that is, matching as many times as possible within the string./g
修饰符指定全局模式匹配 - 即在字符串中尽可能多地匹配。 How it behaves depends on the context. 它的行为取决于上下文。 In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. 在列表上下文中,它返回正则表达式中任何捕获括号匹配的子字符串列表。 If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern. 如果没有括号,则返回所有匹配字符串的列表,就好像整个模式周围有圆括号一样。
The pattern $assignment
uses non-greedy .+?
模式$assignment
使用非贪婪.+?
to cut off the value as soon as the look-ahead sees another assignment or end-of-line. 一旦前瞻看到另一个任务或行尾,就切断价值。 Remember that the match returns the substrings from all capturing subpatterns, so the look-ahead's alternation uses non-capturing (?:...)
. 请记住,匹配返回所有捕获子模式的子字符串,因此前瞻的交替使用非捕获(?:...)
。 The qr//
, in contrast, contains implicit capturing parentheses. 相反, qr//
包含隐式捕获括号。
#! /usr/bin/perl
use warnings;
use strict;
my $string = <<'EOF';
var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello
EOF
my $assignment = qr/\b\w+ = .+?/x;
my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx;
for ( my $i = 0; $i < scalar( @array ); $i++ )
{
print $i.": ".$array[$i]."\n";
}
Output: 输出:
0: var1=100 1: var2=90 2: var5=hello 3: var3="a, b, c" 4: var7=test 5: var3=hello
I'm not saying this is what you should do, but what you're trying to do is write a Grammar . 我不是说这是你应该做的,但你要做的就是写一个语法 。 Now your example is very simple for a Grammar, but Damian Conway 's module Regexp::Grammars is really great at this. 现在你的例子对于语法非常简单,但Damian Conway的模块Regexp :: Grammars在这方面真的很棒。 If you have to grow this at all, you'll find it will make your life much easier. 如果你必须发展这一点,你会发现它会让你的生活更轻松。 I use it quite a bit here - it is kind of perl6-ish. 我在这里使用了很多 - 它有点像perl6-ish。
use Regexp::Grammars;
use Data::Dumper;
use strict;
use warnings;
my $parser = qr{
<[pair]>+
<rule: pair> <key>=(?:"<list>"|<value=literal>)
<token: key> var\d+
<rule: list> <[MATCH=literal]> ** (,)
<token: literal> \S+
}xms;
q[var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello] =~ $parser;
die Dumper {%/};
Output: 输出:
$VAR1 = {
'' => 'var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello',
'pair' => [
{
'' => 'var1=100',
'value' => '100',
'key' => 'var1'
},
{
'' => 'var2=90',
'value' => '90',
'key' => 'var2'
},
{
'' => 'var5=hello',
'value' => 'hello',
'key' => 'var5'
},
{
'' => 'var3="a, b, c"',
'key' => 'var3',
'list' => [
'a',
'b',
'c'
]
},
{
'' => 'var7=test',
'value' => 'test',
'key' => 'var7'
},
{
'' => 'var3=hello',
'value' => 'hello',
'key' => 'var3'
}
]
A bit over the top maybe, but an excuse for me to look into http://p3rl.org/Parse::RecDescent . 有点过头了,但是我可以借此调查http://p3rl.org/Parse::RecDescent 。 How about making a parser? 如何制作解析器?
#!/usr/bin/perl
use strict;
use warnings;
use Parse::RecDescent;
use Regexp::Common;
my $grammar = <<'_EOGRAMMAR_'
INTEGER: /[-+]?\d+/
STRING: /\S+/
QSTRING: /$Regexp::Common::RE{quoted}/
VARIABLE: /var\d+/
VALUE: ( QSTRING | STRING | INTEGER )
assignment: VARIABLE "=" VALUE /[\s]*/ { print "$item{VARIABLE} => $item{VALUE}\n"; }
startrule: assignment(s)
_EOGRAMMAR_
;
$Parse::RecDescent::skip = '';
my $parser = Parse::RecDescent->new($grammar);
my $code = q{var1=100 var2=90 var5=hello var3="a, b, c" var7=test var8=" haha \" heh " var3=hello};
$parser->startrule($code);
yields: 收益率:
var1 => 100
var2 => 90
var5 => hello
var3 => "a, b, c"
var7 => test
var8 => " haha \" heh "
var3 => hello
PS. PS。 Note the double var3, if you want the latter assignment to overwrite the first one you can use a hash to store the values, and then use them later. 请注意double var3,如果您希望后一个赋值覆盖第一个,您可以使用哈希来存储值,然后再使用它们。
PPS. PPS。 My first thought was to split on '=' but that would fail if a string contained '=' and since regexps are almost always bad for parsing, well I ended up trying it out and it works. 我的第一个想法是分裂'=',但是如果一个字符串包含'='并且因为正则表达式几乎总是对解析不好,那么会失败,所以我最终尝试了它并且它有效。
Edit: Added support for escaped quotes inside quoted strings. 编辑:添加了对带引号字符串内的转义引号的支持。
I've recently had to parse x509 certificates "Subject" lines. 我最近不得不解析x509证书“主题”行。 They had similar form to the one you have provided: 它们的形式与您提供的形式类似:
echo 'Subject: C=HU, L=Budapest, O=Microsec Ltd., CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu' | \
perl -wne 'my @a = m/(\w+\=.+?)(?=(?:, \w+\=|$))/g; print "$_\n" foreach @a;'
C=HU
L=Budapest
O=Microsec Ltd.
CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu
Short description of the regex: 正则表达式的简短描述:
(\\w+\\=.+?)
- capture words followed by '=' and any subsequent symbols in non greedy mode (\\w+\\=.+?)
- 捕获后跟'='的单词以及非贪婪模式下的任何后续符号
(?=(?:, \\w+\\=|$))
- which are followed by either another , KEY=val
or end of line. (?=(?:, \\w+\\=|$))
- 后跟另一个, KEY=val
或行尾。
The interesting part of the regex used are: 使用正则表达式的有趣部分是:
.+?
- Non greedy mode - 非贪婪模式 (?:pattern)
- Non capturing mode (?:pattern)
- 非捕获模式 (?=pattern)
zero-width positive look-ahead assertion (?=pattern)
零宽度正向前瞻断言 This one will provide you also common escaping in double-quotes as for example var3="a, \\"b, c". 这个将为您提供双引号中的常见转义,例如var3 =“a,\\”b,c“。
@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g;
In action: 在行动:
echo 'var1=100 var2=90 var42="foo\"bar\\" var5=hello var3="a, b, c" var7=test var3=hello' |
perl -nle '@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g; $,=","; print @a'
var1=100,var2=90,var42="foo\"bar\\",var5=hello,var3="a, b, c",var7=test,var3=hello
#!/usr/bin/perl
use strict; use warnings;
use Text::ParseWords;
use YAML;
my $string =
"var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";
my @parts = shellwords $string;
print Dump \@parts;
@parts = map { { split /=/ } } @parts;
print Dump \@parts;
You asked for a RegEx solution or other code. 您要求提供RegEx解决方案或其他代码。 Here is a (mostly) non regex solution using only core modules. 这是一个(大多数)非正则表达式解决方案,仅使用核心模块。 The only regex is \\s+
to determine the delimiter; 唯一的正则表达式是\\s+
来确定分隔符; in this case one or more spaces. 在这种情况下,一个或多个空格。
use strict; use warnings;
use Text::ParseWords;
my $string="var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";
my @array = quotewords('\s+', 0, $string);
for ( my $i = 0; $i < scalar( @array ); $i++ )
{
print $i.": ".$array[$i]."\n";
}
Or you can execute the code HERE 或者你可以在这里执行代码
The output is: 输出是:
0: var1=100
1: var2=90
2: var5=hello
3: var3=a, b, c
4: var7=test
5: var3=hello
If you really want a regex solution, Alan Moore's comment linking to his code on IDEone is the gas! 如果你真的想要一个正则表达式的解决方案,艾伦摩尔的评论链接到他在IDEone上的代码就是天然气!
It is possible to do this with regexes, however it's fragile. 使用正则表达式可以做到这一点,但它很脆弱。
my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";
my $regexp = qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/x;
my @matches = $string =~ /$regexp/g;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.