简体   繁体   English

Perl中的正则表达式帮助

[英]Regular expression help in Perl

I have following text pattern 我有以下文字模式

(2222) First Last (ab-cd/ABC1), <first.last@site.domain.com> 1224: efadsfadsfdsf

(3333) First Last (abcd/ABC12), <first.last@site.domain.com> 1234, 4657: efadsfadsfdsf

I want the number 1224 or 1234, 4657 from the above text after the text > . 我希望文本>之后的上述文本中的数字12241234, 4657

I have this \\((\\d+)\\)\\s\\w*\\s\\w*\\s\\(\\w*\\/\\w+\\d*\\),\\s<\\w*\\.\\w*\\@\\w*\\.domain.com>\\s\\d+: which will take the text before : But i want the one after email till : 我有这个\\((\\d+)\\)\\s\\w*\\s\\w*\\s\\(\\w*\\/\\w+\\d*\\),\\s<\\w*\\.\\w*\\@\\w*\\.domain.com>\\s\\d+:将采用之前的文本:但我想要一封电子邮件,直到:

Is there any easy regular expression to do this? 有没有简单的正则表达式来做到这一点? or should I use split and do this 或者我应该使用split并执行此操作

Thanks 谢谢

Edit: The whole text is returned by a command line tool. 编辑:整个文本由命令行工具返回。

(3333) First Last (abcd/ABC12), <first.last@site.domain.com> 1234, 4657: efadsfadsfdsf

(3333) - Unique ID (3333) - 唯一ID

First Last - First and last names First Last - 名字和姓氏

<first.last@site.domain.com> - Email address in format FirstName.LastName@sub.domain.com <first.last@site.domain.com> - 电子邮件地址,格式为FirstName.LastName@sub.domain.com

1234, 4567 - database primary Keys 1234, 4567 - 数据库主键

: xxxx - Headline : xxxx - 标题

What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables 我要做的是处理上面的内容并获取数据库ID(在ex:1234,4567 2中单独的ID)并查询表

The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script. 以上是我通过Perl脚本调用的工具的输出(我会得到很多条目)。

My idea was to use a regular expression to get the database id's. 我的想法是使用正则表达式来获取数据库ID。 Guess I could use regular expression for this 猜猜我可以使用正则表达式

you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers: 你可以捏造你不关心的东西来使表达变得更容易,比如使用非贪婪量词的'glob'来表示括号(和电子邮件分隔符)之间的部分:

/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/   (not tested!)

there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings". 只有两个被捕获的组,(1234)和(1234,4657),第二个我只能从你的模式中假设:“一个数字字符串,后跟零个或多个逗号分隔的数字字符串”。

Well, a simple fix is to just allow all the possible characters in a character class. 好吧,一个简单的解决方法是只允许字符类中的所有可能字符。 Which is to say change \\d to [\\d, ] to allow digits, commas and space. 也就是说改变\\d[\\d, ]允许数字,逗号和空格。

Your regex as it is, though, does not match the first sample line, because it has a dash - in it ( ab-cd/ABC1 does not match \\w*\\/\\w+\\d*\\ ). 您正则表达式,因为它是,虽然不符合第一个样本行,因为它有一个破折号-在它( ab-cd/ABC1不匹配\\w*\\/\\w+\\d*\\ )。 Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. 此外,过分依赖*量词并不是一个好主意,因为它确实匹配空字符串(它匹配零次或多次),并且只应用于真正可选的事物。 Use + otherwise, which matches (1 or more times). 使用+否则匹配(1次或多次)。

You have a rather strict regex, and with slight variations in your data like this, it will fail. 你有一个相当严格的正则表达式,并且像这样的数据略有变化,它将失败。 Only you know what your data looks like, and if you actually do need a strict regex. 只有你知道你的数据是什么样的,如果你确实需要一个严格的正则表达式。 However, if your data is somewhat consistent, you can use a loose regex simply based on the email part: 但是,如果您的数据有些一致,则可以根据电子邮件部分使用松散的正则表达式:

sub extract_nums {
    my $string = shift;
    if ($string =~ /<[^>]*> *([\d, ]+):/) {
        return $1 =~ /\d+/g;   # return the extracted digits in a list
        # return $1;           # just return the string as-is
    } else { return undef }
}

This assumes, of course, that you cannot have <> tags in front of the email part of the line. 当然,这假定您不能在该行的电子邮件部分前面添加<>标签。 It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. 它将捕获在<>标记和冒号之间找到的任何数字,逗号和空格,然后返回匹配中找到的任何数字的列表。 You can also just return the string, as shown in the commented line. 您也可以只返回字符串,如注释行所示。

There would appear to be something missing from your examples. 您的示例中似乎缺少某些内容。 Is this what they're supposed to look like, with email? 这是他们应该看起来像电子邮件的样子吗?

(1234) First Last (ab-cd/ABC1), <foo.bar@domain.com> 1224: efadsfadsfdsf

(1234) First Last (abcd/ABC12), <foo.bar@domain.com> 1234, 4657: efadsfadsfdsf

If so, this should work: 如果是这样,这应该工作:

\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\@\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;

That's it. 而已。 Tested. 测试。

With number catching: 数字捕捉:

$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;

Not tested but you get the idea. 没有经过测试,但你明白了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM