简体   繁体   中英

Why doesn't this regex work with Ruby

Trying to to match the hash character fails, but succeeds for any other member of the regex.

Why does this fail?

Thanks,

Joe

UNIT = [ 'floor', 'fl', '#', 'penthouse', 'mezzanine', 'basement', 'room' ]

unit_regex = "\\b(" + UNIT.to_a.join("|") + ")\\b"

unit_regexp = Regexp.new(unit_regex, Regexp::IGNORECASE)

x=unit_regexp.match('#')

As noted in the comments, your problem is that \\b is a word boundary inside a regex (unless it is inside a character class, sigh, the \\b in /[\\b]/ is a backspace just like in a double quoted string). A word boundary is roughly

a word character on one side and nothing or a non-word character on the other side

But # is not a word character so /\\b/ can't match '#' at all and your whole regex fails to match.

You're going to have to be more explicit about what you're trying to match. A first stab would be "the beginning of the string or whitespace" instead of the first \\b and "the end of the string or whitespace" instead of the second \\b . That could be expressed like this:

unit_regex = '(?<=\A|\s)(' + UNIT.to_a.join('|') + ')(?=\z|\s)'

Note that I've switched to single quotes to avoid all the double escaping hassles. The ?<= is a positive lookbehind , that means that (\\A|\\s) needs to be there but it won't be matched by the expression; similarly, ?= is a positive lookahead . See the manual for more details. Also note that we're using \\A rather than ^ since ^ matches the beginning of a line not the string ; similarly, \\z instead of $ because \\z matches the end of the string whereas $ matches the end of a line .

You may need to tweak the regex depending on your data but hopefully that will get you started.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM