[英]How can I improve this small Ruby Regex snippet?
the purpose of this code is to be used in a method that captures a string of hash_tags #twittertype from a form - parse through the list of words and make sure all the words are separated out. 此代码的目的是在一个方法中使用,该方法从表单中捕获一串hash_tags #twittertype - 解析单词列表并确保所有单词都被分离出来。
WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
SECOND_TEST = 'orion#Orion#oRion,Mike'
This is my problem area RegXps... 这是我的问题区域RegXps ...
_string_rgx = /([a-zA-Z0-9]+(-|_)?\w+|#?[a-zA-Z0-9]+(-|_)?\w+)/
add_pound_sign = lambda { |a| a[0].chr == '#' ? a : a='#' + a; a}
I don't know that much Regular Expressions: hence the needed collect the first[element] from the result of the scan -> It yielded weird stuff but the first element was always what I wanted. 我不知道那么多正则表达式:因此需要收集扫描结果中的第一个[元素] - >它产生了奇怪的东西,但第一个元素总是我想要的。
t_word = WORD_TEST.scan(_string_rgx).collect {|i| i[0] }
s_word = SECOND_TEST.scan(_string_rgx).collect {|i| i[0] }
t_word.map! { |a| a = add_pound_sign.call(a); a }
s_word.map! { |a| a = add_pound_sign.call(a); a }
The results are what I want. 结果是我想要的。 I just want insight from Ruby | 我只想要Ruby的见解 Regex guru's out there. 正则表达的大师在那里。
puts t_word.inspect
[
"#123", "#sunset", "#2d2-apple", "#home", "#star", "#Babyclub",
"#apple_surprise", "#apple", "#cats", "#mustard", "#dog",
"#basic_cable", "#safety", "#222", "#dog-D", "#DOG", "#2D"
]
puts s_word.inspect
[
"#orion", "#Orion", "#oRion", "#Mike"
]
Thanks in advance. 提前致谢。
Lets unfold the regex: 让我们展开正则表达式:
(
[a-zA-Z0-9]+ (-|_)? \w+
| #? [a-zA-Z0-9]+ (-|_)? \w+
)
(
begin capture group (
开始捕获组
[a-zA-Z0-9]+
match one or more alphanumeric characters [a-zA-Z0-9]+
匹配一个或多个字母数字字符
(-|_)?
match a hyphen or an underscore and save. 匹配连字符或下划线并保存。 This group may fail 这个组可能会失败
\\w+
match one or more "word" characters (alphanumeric + underscore) \\w+
匹配一个或多个“单词”字符(字母数字+下划线)
|
OR match this: 或者匹配这个:
#?
match optional #
character 匹配可选#
字符
[a-zA-Z0-9]+
match one or more alphanumeric characters [a-zA-Z0-9]+
匹配一个或多个字母数字字符
(-|_)?
match hyphen or underscore and capture. 匹配连字符或下划线和捕获。 may fail. 可能会失败。
\\w+
match one or more word characters \\w+
匹配一个或多个单词字符
)
end capature )
结束字幕
I'd rather write this regex like this; 我宁愿像这样写这个正则表达式;
(#? [a-zA-Z0-9]+ (-|_)? \w+)
or 要么
( #? [a-zA-Z0-9]+ (-?\w+)? )
or 要么
( #? [a-zA-Z0-9]+ -? \w+ )
(all are reasonably equivalent) (都相当合理)
You should note that this regex will fail on hashtags with unicode characters, eg #Ü-Umlaut, #façade
etc. You are also limited to a two-character minimum length ( #a
fails, #ab
matches) and may have only one hyphen ( #abc
fails / would return #ab
) 你应该注意,这个正则表达式将在带有unicode字符的主题标签上失败,例如#Ü-Umlaut, #façade
等。你也被限制为两个字符的最小长度( #a
失败, #a
#ab
匹配)并且可能只有一个连字符( #abc
失败/将返回#ab
)
I would reduce your Regex pattern such as this: 我会减少你的正则表达式模式如下:
WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
foo = []
WORD_TEST.scan(/#?[-\w]+\b/) do |s|
foo.push( s[0] != '#' ? '#' + s : s )
end
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.