简体   繁体   中英

Why does sed match something outside the group as part of the group?

I was trying to use sed recently to generate a bunch of methods from comma-and-newline separated enumeration members. I ran into the following behavior which seems unintuitive:

$ echo 'Hello,' | sed 's/\(.*\),\?/"Hi \1!"/g'
"Hi Hello,!"

Here I'm trying to capture everything before the comma into a group via \\(.*\\) , then I allow an optional comma with ,\\? . I expected this to replace \\1 with everything before the first comma, namely Hello , but for some reason the comma is getting included in the substitution too although it is not inside the group. Why is this the case?

Regular expressions do greedy matching (from left to right) by default, backtracking if the greediest match doesn't work. So in the case of \\(.*\\),\\? , the greediest match is to match Hello, to the \\(.*\\) and nothing to the ,\\? .

I'm not sure how to do non-greedy matching in basic regular expressions (which is what sed uses). In Perl-style regular expressions (not used by sed ), you put a question mark after the matching operator, so you'd use something like (.*?),? .

The next best thing you can do is to use something like \\([^,]*\\),\\? , but then it'd stop matching at the first comma it sees.

That's because sed Regex is greedy and the ? quantifier means to match 0 or 1 of the preceding token -- , in this case.

So, here the engine greedily matches till the end, and as the ? is made optional by ? , it is being included too within the captured group (.*) .

To get the desired behavior, drop ? :

%  echo 'Hello,' | sed 's/\(.*\),\?/"Hi \1!"/g'
"Hi Hello,!"

%  echo 'Hello,' | sed 's/\(.*\),/"Hi \1!"/g' 
"Hi Hello!"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM