简体   繁体   English

正则表达式匹配一个或多个组太多

[英]Regex is matching one or more groups too many

I have a series of filenames of varying complexity. 我有一系列复杂程度各异的文件名。 Basically, they are always split up by [_]{ASSET}_[OPTIONAL_DESCRIPTION]_v{#####}.{EXT}. 基本上,它们总是由[_] {ASSET} _ [OPTIONAL_DESCRIPTION] _v {#####}。{EXT}分割。 ([]s are optional, in this case). (在这种情况下,[]是可选的)。 Within that format though, each piece can be arbitrarily complex. 但是在这种格式下,每一部分都可以任意复杂。 (leading _s are arbitrary) (前导_是任意的)

character_thing_v001.md
character_Description_v001.md
character_Some_Long_Description_v001.md
character_thing_with_additional_info_v001.md
character_thing_with_additional_info_Description_v001.md
character_thing_with_additional_info_More_Description_Info_v001.md
character_with_additional_info_Complete234ly_arbitrary_Description_v001.md
_character_thing_v001.md
___character_Description_v001.md
____character_Some_Long_Description_v001.md
__character_thing_with_additional_info_v001.md
__character_thing_with_additional_info_Description_v001.md
___character_thing_with_additional_info_More_Description_Info_v001.md

I made a lookahead assertion to separate ASSET and DESCRIPTION and everything worked fine until just recently, when my boss threw a wrench in the system. 我做了一个先行的断言,将资产和描述分开,直到最近,当我的老板在系统中扳动扳手时,一切都运转良好。 Now I have to support assets whose convention could be "some_undercase" OR "CAPS_###". 现在,我必须支持约定为“ some_undercase”或“ CAPS _ ###”的资产。 I modified to allow AZ and made descriptionText match anything. 我进行了修改,以允许AZ,并使descriptionText匹配任何内容。 That's where the mess started. 那是混乱的开始。

     (?:[_]+)?
     (?P<assetText>[a-zA-Z0-9]+
       (?=_[a-zA-Z0-9]+)?  # lookahead and optionally assert _Capital
         (?:(?:_[a-zA-Z0-9]+)+)?  # match next group if it exists
     )  # get full match
     (?:[_]+)?
     \_(?P<descriptionText>.+)?
     \_v(?P<versionIncrement>\d+)
     \.(?:\.)?
       (?P<extension>(?:md|some|other|extension|options)) 

This gets me part of the way there but it has problems that you can view, here 这让我的存在方式的一部分,但它有问题,你可以看到, 在这里

Now that the ASSET can have capitals, the lookahead matches too much for ASSET and starts going into the DESCRIPTION. 既然ASSET可以有大写字母,那么与ASSET匹配的前瞻就太多了,并开始进入DESCRIPTION。 This pattern is one of several that gets automatically generated so I'm looking for a way to solve the root of the problem, rather than write around it. 这种模式是自动生成的几种模式中的一种,因此我正在寻找一种解决问题根源的方法,而不是一味解决。 Any guidance would be really appreciated, thank you. 任何指导将不胜感激,谢谢。

I can't really follow the logic of some of the parts of your regex that seem unnecessary. 我无法真正遵循您的正则表达式某些似乎不必要的部分的逻辑。

Doesn't this simplified regex do the same job? 这个简化的正则表达式不做同样的工作吗?

_*
(?P<assetText>[a-zA-Z0-9]+(_[a-z_0-9]+)?)
(_  (?P<descriptionText>[a-zA-Z0-9_]+)  )?
_v(?P<versionIncrement>[0-9]+)
(?P<extension>\.[A-Za-z0-9]+)

Perhaps the (natural-language) rules for what constitutes an asset and what constitutes an optional description need to be clarified: 可能需要澄清关于什么构成资产和什么构成可选描述的(自然语言)规则:

  • Can an "asset" contain an underscore (I'm assuming not, from the template in your first sentence)? “资产”能否包含下划线(我假设第一句中的模板没有下划线)?
    • If yes: what's the rule for where "asset" ends and "description" begins? 如果是,“资产”结束和“描述”开始的规则是什么? Is it that the description always starts with an upper-case letter? 是不是描述总是以大写字母开头?
      • If yes: what are the rules for where upper-case letters can and cannot appear with the "asset"? 如果是,那么“资产”中大写字母可以出现和不能出现的规则是什么? If there are no restrictions, then the split between asset and description is truly ill-defined. 如果没有限制,那么资产和描述之间的划分确实是不明确的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM