简体   繁体   English

Grep Perl 非贪婪 Scope 正则表达式匹配空字符串的问题

[英]Issue with Grep Perl Non-Greedy Scope RegEx Matching on Empty String

All:全部:

As the subject states, I'm running into an issue with Grep Perl Non-Greedy Scope RegEx Matching on an Empty String.正如主题所述,我遇到了 Grep Perl 非贪婪 Scope 正则表达式匹配空字符串的问题。

[Note: For the purposes of this example assume that the 'title' can be a complex, alpha-numeric, special-character, multi-word, space-separated, string.] [注意:出于本示例的目的,假设“标题”可以是复杂的、字母数字、特殊字符、多词、空格分隔的字符串。]

# echo "<span class=\"title\"></span><span class=\"price\">0.25</span><span class=\"title\">Banana</span><span class=\"price\">0.10</span><span class=\"title\">Grape</span><span class=\"price\">0.05</span>" | /opt/bin/grep -ioP "<span class=\"title\">(.+?)</span><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g;"
|0.25Banana|0.10
Grape|0.05

As you can see, the first 'title' match is empty, but the grep perl non-greedy scope regex (.+?) still matches.如您所见,第一个“标题”匹配为空,但 grep perl 非贪婪 scope 正则表达式(.+?)仍然匹配。

Shouldn't the first 'title' match be ignored?不应该忽略第一个“标题”匹配吗? What am I missing?我错过了什么?

Thank you for your assistance.谢谢您的帮助。

UPDATE :更新

Negating the lessthan-sign ([^<]+?) is a good solution with the original, basic example.否定小于号([^<]+?)是原始基本示例的一个很好的解决方案。 However, I'm finding that it runs into problems when more data is introduced.但是,我发现当引入更多数据时它会遇到问题。

I've attempted to expand the match to include additional trailing tags, but the regex appears to still be failing with that change as well.我试图扩展匹配以包含额外的尾随标签,但正则表达式似乎仍然因该更改而失败。

# echo "<span class=\"title\"></span></div></div><span class=\"price\">0.25</span><span class=\"title\">Banana</span></div></a><span class=\"price\">0.10</span><span class=\"title\">Grape</span></div></a><span class=\"price\">0.05</span>" | grep -ioP "<span class=\"title\">(.+?)</span></div></a><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g; s/<\/div>//g; s/<\/a>//g;"
|0.25Banana|0.10
Grape|0.05

Shouldn't the regex match on the </span></div></a> tags, but not on the </span></div></div> tags?正则表达式不应该匹配</span></div></a>标签,而不是</span></div></div>标签吗?

Thanks, again, for your time and assistance.再次感谢您的时间和帮助。

Your elected regular expression <span class="title">(.+?)</span> which assumes a presence at least one symbol in title tag - what leads regex to capturing from this place skipping empty tag until next closing </span> tag, definitely not what you intended to achieve.您选择的正则表达式<span class="title">(.+?)</span>假定标题标签中至少存在一个符号 - 是什么导致正则表达式从这个地方捕获跳过空标签直到下一次关闭</span>标签,绝对不是你想要实现的。

Perhaps following code is self explanatory也许下面的代码是不言自明的

use strict;
use warnings;

my $re = qr!<span class="title">(.+?)</span><span class="price">(.*?)</span>!;

my $input = do { local $/; <DATA> };
my %data = $input =~ /$re/g;

for my $k ( sort keys %data ) {
    printf "| %-10s | %6.2f |\n", $k, $data{$k};
}

__DATA__
<span class="title"></span><span class="price">0.25</span><span class="title">Banana</span><span class="price">0.10</span><span class="title">Grape</span><span class="price">0.05</span>

Output Output

| </span><span class="price">0.25</span><span class="title">Banana |   0.10 |
| Grape      |   0.05 |

Perhaps you intended to use following regular expression也许您打算使用以下正则表达式

use strict;
use warnings;

my $re = qr!<span class="title">([^<]+?)</span><span class="price">(.*?)</span>!;

my $input = do { local $/; <DATA> };
my %data = $input =~ /$re/g;

for my $k ( sort keys %data ) {
    printf "| %-10s | %6.2f |\n", $k, $data{$k};
}

__DATA__
<span class="title"></span><span class="price">0.25</span><span class="title">Banana</span><span class="price">0.10</span><span class="title">Grape</span><span class="price">0.05</span>

Output Output

| Banana     |   0.10 |
| Grape      |   0.05 |

So, if you chosen an approach to utilize grep and sed then command perhaps would take following shape因此,如果您选择使用grepsed的方法,那么命令可能会采用以下形式

echo "<span class=\"title\"></span><span class=\"price\">0.25</span><span class=\"title\">Banana</span><span class=\"price\">0.10</span><span class=\"title\">Grape</span><span class=\"price\">0.05</span>" | grep -ioP "<span class=\"title\">([^<]+?)</span><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g;"

Output Output

Banana|0.10
Grape|0.05

If perl available in your system perhaps it would be easier to utilize it's power.如果perl在您的系统中可用,那么使用它的功能可能会更容易。

@PolarBear Success, With your guidance, I finally figured out the optimal solution for my particular issue. @PolarBear Success,在您的指导下,我终于找到了针对我的特定问题的最佳解决方案。 still making use of the original non-greedy scope regex match (?+,).仍在使用原始的非贪婪 scope 正则表达式匹配 (?+,)。 which was to include additional leading tags that uniquely identified the specific groups I was targeting while excluding those that did not match.这将包括额外的前导标签,这些标签唯一地标识了我所针对的特定群体,同时排除了那些不匹配的群体。 Appreciate your assistance and positive feedback.感谢您的帮助和积极的反馈。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM