简体   繁体   English

Ruby Regex匹配意外字符

[英]Ruby Regex matching unexpected characters

I am trying to write a script that parses filename of a comicbook and tries to extract info such as Seriesname, Publication year etc.In this case, I am trying to extract publication year from the name. 我试图编写一个脚本来解析漫画书的文件名,并尝试提取诸如系列名称,出版年份等信息。在这种情况下,我试图从名称中提取出版年份。 Consider the following name, I would need to match and get value 2003. Below is the expression I had for this. 考虑以下名称,我需要匹配并获得2003的值。下面是我为此使用的表达式。

r = %r{ (?i)(^|[,\s-_])v(\d{4})($|[,\s-_]) }

However this matches the number irrespective of what character I have before the v or after the number 但这与数字匹配,无论我在v之前还是在数字之后使用什么字符

I expect the first two to not match and the third to match. 我希望前两个不匹配,第三个匹配。

  • 010 - All Star Batman & Robin The Boy Wonder 01 - av2003 010-全明星蝙蝠侠和罗宾The Boy Wonder 01-av2003
  • 010 - All Star Batman & Robin The Boy Wonder 01 - v2003t 010-全明星蝙蝠侠和罗宾The Boy Wonder 01-v2003t
  • 010 - All Star Batman & Robin The Boy Wonder 01 - v2003 010-全明星蝙蝠侠和罗宾The Boy Wonder 01-v2003

What am I doing wrong in this case? 我在这种情况下做错了什么?

Inside character classes (ie. [] s) the - character has a special meaning when it's between two other characters: it creates a range starting the character before and ending at the character after. 在字符类(即[] s)内部, -字符在其他两个字符之间时具有特殊含义:它创建一个范围,该范围开始于字符之前,之后于字符之后。

Here, you want it literally, so you should either escape the - or (more idiomatically in regex) put it as the first or last character in the [] . 在这里,您确实需要它,因此您应该转义-或(在regex中更惯用)将其作为[]的第一个或最后一个字符。

Also, btw, you have literal space characters, but no /x modifier, also you probably don't want to capture what's before and after the year, so the final pattern would be: 另外,顺便说一句,您有文字空格字符,但是没有/x修饰符,您也可能不想捕获年份前后的内容,因此最终模式将是:

%r{(?i)(?:^|[,\s_-])v(\d{4})(?:$|[,\s_-])}

@smathy answered your question (rather nicely). @smathy回答了您的问题(很好)。 I want to point out that you could write your regex without a capture group: 我想指出的是,您可以在没有捕获组的情况下编写正则表达式:

 r = /
     (?:         # begin a non-capture group 
       ^|[,\s_-] # match the beginning of the string, a ws char or char in ',_-'
     )           # end the non-capture group
     v           # match v
     \K          # forget everything matched so far
     \d{4}       # match 4 digits
     (?=         # begin a positive look-ahead
       $|[,\s_-] # match the end of the string, a ws char or char in ',_-'
     )           # end positive lookahead
     /x

"010 - All Star Batman & Robin The Boy Wonder 01 - av2003"[r]
  #=> nil 
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003t"[r]
  #=> nil 
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003"[r]
  #=> "2003
  • If you wish to match v or V , change the line v to [vV] . 如果要匹配vV ,请将行v更改为[vV]
  • If you wish the regex to be case independent, change /x to /ix (in which case there is no need to replace v with [vV] ). 如果希望正则表达式不区分大小写,请将/x更改为/ix (在这种情况下,无需用[vV]替换v )。
  • If you wish to ensure the publication date is (say) in the 20th or 21st century, change \\d{4} to [12]\\d{3} . 如果要确保发布日期是(例如)20或21世纪,请将\\d{4}更改为[12]\\d{3}
  • You could alternatively change the non-capture group to a positive lookbehind ( (?<=^|[,\\s_-]) ) and delete \\K . 您也可以将非捕获组更改为正向后方( (?<=^|[,\\s_-]) )并删除\\K

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM