简体   繁体   English

如何在此RegEx中指定可选的捕获组?

[英]How can I specify an optional capture group in this RegEx?

How can I fix this RegEx to optionally capture a file extension? 如何修复此RegEx以选择性地捕获文件扩展名?

I am trying to match a string with an optional component, but something appears to be wrong. 我试图将字符串与可选组件匹配,但似乎有些错误。 (The strings being matched are from a printer log.) (匹配的字符串来自打印机日志。)


My RegEx (.NET Flavor) is as follows: 我的RegEx(.NET Flavor)如下:

.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).*
-------------------------------------------
.*                   # Ignore some garbage in the front
(header_             # Match the start of the file name,
    \d{10,11}_)      #     including the ID (10 - 11 digits)
.*                   # Ignore the type code in the middle
(_.*_\d{8})          # Match some random characters, then an 8-digit date
.*                   # Ignore anything between this and the file extension
(\.\w{3,4})          # Match the file extension, 3 or 4 characters long
.*                   # Ignore the rest of the string


I expect this to match strings like: 我希望这匹配如下字符串:

str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]"
str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt"
str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]"


Where the capture groups return something like: 捕获组返回的内容如下:

$1  =  header_0000000602_
$2  =  _mc2e1nrobr1a3s55niyrrqvy_20081212
$3  =  .doc


Where $3 can be empty if no file extension is found. 如果没有找到文件扩展名,$ 3可以为空。 $3 is the optional part, as you can see in str3 above. $ 3是可选部分,如上面的str3所示。

If I add "?" 如果我加“?” to the end of the third capture group "(.\\w{3,4})?", the RegEx no longer captures $3 for any string. 到第三个捕获组“(。\\ w {3,4})?”结束时,RegEx不再为任何字符串捕获3美元。 If I add "+" instead "(.\\w{3,4})+", the RegEx no longer captures str3 at all, which is to be expected. 如果我添加“+”而不是“(。\\ w {3,4})+”,则RegEx根本不再捕获str3,这是预期的。

I feel that using "?" 我觉得用“?” at the end of the third capture group is the appropriate thing to do, but it doesn't work as I expect. 在第三个捕获组的末尾是适当的事情,但它不能像我期望的那样工作。 I am probably being too naive with the ".*" sections that I use to ignore parts of the string. 对于我用来忽略字符串部分的“。*”部分,我可能太天真了。


Doesn't Work As Expected: 不按预期工作:

.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*

One possibility is that the second to last .* is being greedy. 一种可能性是倒数第二个.*正在贪婪。 You might try changing it to: 您可以尝试将其更改为:

.*(header_\d*_).*(_.*_.{8}).*?(\.\w{3,4})?.*
                             ^ Added that

That wasn't correct, this one will match the input you supplied, but it assumes that the first . 这是不正确的,这个将匹配您提供的输入,但它假定第一个. it encounters is the start of a file extension: 遇到的是文件扩展名的开头:

.*(header_\d*_).*(_.*_.{8})[^\.]*(\.\w{3,4})?.*

Edit: Remove the escaping I had in the second regex. 编辑:删除我在第二个正则表达式中的转义。

I believe the problem is in your 3rd .* , which you annotated above with "Ignore anything between this and the file extension". 我相信问题出现在您的第3个.* ,您在上面注释了“忽略此文件扩展名之间的任何内容”。 It's greedy, so it will match ANYTHING. 这很贪心,所以它会与任何东西相匹配。 When you make the extension pattern optional, the 3rd .* matches up to the end of the string, which is allowed. 当您使扩展模式可选时,第3。 .*匹配字符串的结尾,这是允许的。 Assuming that there will NEVER be a ' . 假设永远不会是' . ' character in that extraneous bit, you can replace .* with [^.]* and the rest will hopefully work after you restore the ? '那个无关紧要的角色,你可以用[^.]*取代.*其余的希望在恢复之后有效? that you had to remove. 你必须删除。

Well, .* is probably the wrong way to start the regex- it will match 0 or more ( * ) single characters of anything (.) ...which means your entire file name will be matched by that alone. 嗯, .*可能是启动正则表达式的错误方法 - 它将匹配0或更多( * )任何单个字符(。)...这意味着您的整个文件名将仅由该匹配。 If you leave that off the regex will start matching when it reaches header which is what you want. 如果你离开它,正则表达式将在它到达你想要的header时开始匹配。 You could also replace it with \\w , which matches word breaks. 你也可以用\\w替换它,它匹配单词分隔符。 I also suggest using a tool such as The Regex Coach so you can step through it and see exactly what's wrong and what your capture groups will be. 我还建议使用像The Regex Coach这样的工具,这样你就可以逐步完成它,看看究竟出了什么问题以及你的捕获组将是什么。

在第二场比赛中指定您只想匹配其中没有句号的所有字符, 然后匹配您的分机。

".*(header_\d{10,11}_).*(_.*_\d{8})[^.]*(\.\w{3,4})?"

This is your correct result 这是你的正确结果

.*?(header_\d*_).*?(_.*_.{8})[^.]*(\.\w{3,4})?.*
-------------------------------------------
.*?                  # Prevent a greedy match
(header_             # 
    \d{10,11}_)      # 
.*?                  # Prevent a greedy match
(_.*_\d{8})          # 
[^.]*                # Take everything that is NOT a period
(\.\w{3,4})          # Match the extension
.*                   # 

The implicit assumption is that the period will be the beginning of a file extension after the digits match. 隐含的假设是句点将是数字匹配后文件扩展名的开头。 The following wouldn't meet this requirement: 以下内容不符合此要求:

string unmatched = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].foobar.txt"

Also , when taking out your groups in .NET make sure your code looks like this: 此外 ,在.NET中取出组时,请确保您的代码如下所示:

regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value
regex.Match(string_to_match).Groups[3].Value

and not this: 而不是这个:

// 0 index == string_to_match
regex.Match(string_to_match).Groups[0].Value
regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value

This is something that tripped me up at first. 这首先让我绊倒了。

This works for the examples you've posted: 这适用于您发布的示例:

^.*?(?<header>\d+)_.*?_(?<date>\d{8}).*?(?:\.(?<ext>\w{3,4}))?[\w\s\[\]]*$

I'm assuming that the text "header" and the random characters between that and the date aren't important, so those aren't captured by this regex. 我假设文本“标题”和它与日期之间的随机字符并不重要,因此这些正则表达式不会捕获这些字符。 I also used the .NET named capture feature for clarity, but be aware that it isn't supported in other flavors of RegEx. 为清晰起见,我还使用了.NET命名捕获功能,但要注意其他版本的RegEx不支持它。

If the text after the file name contains any non-alphanumeric characters other than [ and ], the pattern will need to be revised. 如果文件名后面的文本包含[和]以外的任何非字母数字字符,则需要修改该模式。

Here is one that works for what you're posting: 这是一个适用于你发布的内容:

^.*(?<header>header_\d{10,11})_.*(?<date>_[a-z0-9]+_\d{8})(\[\d+\])(?<ext>(\.[a-zA-Z0-9]{3,4})?).*

The replacement is: 替换是:

Header: $1
Date: $2
Extension: $4

I didn't use the named groups in the replacement because I couldn't figure out how to get TextMate to do it, but the named groups were helpful to force the capture. 我没有在替换中使用命名组,因为我无法弄清楚如何让TextMate这样做,但命名组有助于强制捕获。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM