简体   繁体   English

Perl RegEx获取两个标签之间的字的子字符串

[英]Perl RegEx to get substring of word found between two tags

I have a question related to regex. 我有一个与正则表达式相关的问题。 I have an element as $str1 = <strong>average_speed_answer_good_high</strong> What I am trying to do is to get the string before "_good_high" (which in this case is "average_speed_answer" ) in a variable $sub_str1 in one variable and "good_high" in a variable $sub_str2 . 我有一个元素$str1 = <strong>average_speed_answer_good_high</strong>我要做的是在一个变量中的变量$sub_str1中获取"_good_high"之前的字符串(在本例中为"average_speed_answer" )变量$sub_str2中的"good_high"

Here "_good_high" is the only constant part of the string and the rest can change. 这里"_good_high"是字符串中唯一不变的部分,其余部分可以改变。 Even after "_good_high" , there could be some characters before "</strong>" . 即使在"_good_high""</strong>"之前也可能会出现一些字符。 Can I get some tips on how I can do this? 我可以获得一些关于如何做到这一点的提示吗?

Until now, I was able to do something like: 到现在为止,我能够做到这样的事情:

if ( $str1 =~ m{(<strong>)(.*?)(</strong>)} ) {
    $sub_str1 = $2; #which gives average_speed_answer_good_high
}

I have tried some combinations like, 我试过一些组合,比如

(<strong>)(?=_good_high)(</strong>) 
(<strong>)(?<=_good_high)(</strong>) 
(<strong>)((?<=_good_high)\w+)(</strong>) #tried $2 and $3
(<strong>)(?<=_good_high)\w+(</strong>) 
(<strong>)((?<=(_good_high))\w+)(</strong>)#tried $2, $3 and $4

but they all put blank in $sub_str1 . 但他们都在$sub_str1留空了。

I would appreciate any help or tips. 我将不胜感激任何帮助或提示。

You need to specify _good_high before the closing strong tag. 您需要在结束强标记之前指定_good_high

if ( $str1 =~ m{(<strong>)(.*?)_good_high.*?(</strong>)} ) {
    $sub_str1 = $2; 
}

or 要么

if ( $str1 =~ m{<strong>(.*?)_good_high.*?</strong>} ) {
    $sub_str1 = $1; 
}

怎么样:

($sub_str1) = $str1 =~ m{<strong>(.*?)_good_high</strong>};

Don't get too hung up on regexes and capture groups. 不要太依赖正则表达式和捕获组。 They're not the only tool in your box. 它们不是你盒子里唯一的工具。

For example: 例如:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $str1 = '<strong>average_speed_answer_good_high</strong>';
if ( my ($sub_str1) = $str1 =~ m{<strong>(.*?)</strong>} ) {
    print "Substr: $sub_str1\n";
    my @split_str = split ( /_/, $sub_str1 );
    print Dumper \@split_str; 
    print "Extracted: ",join ( "_", (split ( /_/, $sub_str1 ))[0..2] ),"\n";
}

We extract the substring as before - but then we split it using _ : 我们像以前一样提取子字符串 - 但之后我们使用_分割它:

$VAR1 = [
          'average',
          'speed',
          'answer',
          'good',
          'high'
        ];

And then stick it together again, preserving elements 0 to 2 to get your answer. 然后将它再次粘在一起,保留元素02以获得答案。

Your problems seem to result from your understanding of the functioning of ( , ) , ? 您的问题似乎是由于您对()的运作有所了解而产生的? , .* , and .* . .*.*

In your second-part examples, there is no variable part, only grouping, sometimes without capturing. 在您的第二部分示例中,没有可变部分,只有分组,有时没有捕获。

  • pre(.*)post causes capturing up all between pre and post in $1 pre(.*)post导致在$1 prepost 捕获所有内容
  • pre(?:a|b|c)post causes grouping of alternatives without capturing pre(?:a|b|c)post会导致备选方案的分组而不会被捕获
  • a(.*?)b causes non-greedy matching (+capturing): matching x instead xby in axbyb a(.*?)b使非贪婪匹配 (+捕获):匹配x代替xbyaxbyb

I think the best way is as follows. 我认为最好的方法如下。 Just look for all text except angle brackets that is preceded by a <strong> tag (there's no need to search for the end tag) followed by _good_high . 只需查找除了尖括号之外的所有文本,前面带有<strong>标记(不需要搜索结束标记),然后是_good_high That is the wanted substring 那是想要的子串

use strict;
use warnings;

my $s = <<END;
<html>
  <body>
    <strong>average_speed_answer_good_high</strong>
  </body>
</html>
END

if ( my ($text) = $s =~ /<strong>([^<>]+)_good_high/ ) {
    print $text, "\n";
}

output 产量

average_speed_answer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM