简体   繁体   中英

Problem with using grep to match the whole word

I am trying to match a whole string in a list of new line separated strings. Here is my example:

[hemanth.a@gateway ~]$ echo $snapshottableDirs
/user/hemanth.a/dummy1 /user/hemanth.a/dummy3
[hemanth.a@gateway ~]$ echo $snapshottableDirs | tr -s ' ' '\n'
/user/hemanth.a/dummy1
/user/hemanth.a/dummy3
[hemanth.a@gateway ~]$ echo $snapshottableDirs | tr -s ' ' '\n' | grep -w '/user/hemanth.a'
/user/hemanth.a/dummy1
/user/hemanth.a/dummy3

My aim is to only find a match if and only if the string /user/hemanth.a exists as a whole word(in a new line) in the list of strings. But the above command is also returning strings that contain /user/hemanth.a .

This is a sample scenario. There is no guarantee that all the strings that I would want to match will be in the form of /user/xxxxxx.x . Ideally I would want to match the exact string if it exists in a new line as a whole word in the list.

Any help would be appreciated. thank you.

Update : Using fgrep -x '/user/hemanth.a' is probably a better solution here, as it avoids having to escape characters such as $ to prevent grep from interpreting them as meta-characters. fgrep performs a literal string match as opposed to a regular expression match, and the -x option tells it to only match whole lines.

Example:

> cat testfile.txt
foo
foobar
barfoo
barfoobaz

> fgrep foo testfile.txt
foo
foobar
barfoo
barfoobaz

> fgrep -x foo testfile.txt
foo

Original answer :

Try adding the $ regex metacharacter to the end of your grep expression, as in:

echo $snapshottableDirs | tr -s ' ' '\n' | grep -w '/user/hemanth.a$'. 

The $ metacharacter matches the end of the line.

While you're at it, you might also want to use the ^ metacharacter, which matches the beginning of the line, so that grep '/user/hemanth.a$' doesn't accidentally also match something like /user/foo/user/hemanth.a .

So you'd have this:

echo $snapshottableDirs | tr -s ' ' '\n' | grep '^/user/hemanth\.a$'. 

Edit : You probably don't actually want the -w here, so I've removed that from my answer.

Edit 2 : @U. Windl brings up a good point. The . character in a regular expression is a metacharacter that matches any character, so grep /user/hemanth.a might end up matching things you're not expecting, such as /user/hemanthxa , etc. Or perhaps more likely, it would also match the line /user/hemanth/a . To fix that, you need to escape the . character. I've updated the grep line above to reflect this.

Update : In response to your question in the comments about how to escape a string so that it can be used in a grep regular expression...

Yes, you can escape a string so that it should be able to be used in a regular expression. I'll explain how to do so, but first I should say that attempting to escape strings for use in a regex can become very complicated with lots of weird edge cases. For example, an escaped string that works with grep won't necessarily work with sed , awk , perl , bash's =~ operator, or even grep -e .

On top of that, if you change from single quotes to double quotes, you might then have to add another level of escaping so that bash will expand your string properly.

For example, if you wanted to search for the literal string 'foo [bar]* baz$' using grep , you'd have to escape the [ , * , and $ characters, resulting in the regular expression:

'foo \[bar]\* baz\$'

But if for some reason you decided to pass that expression to grep as a double-quoted string, you would then have to escape the escapes. Otherwise, bash would interpret some of them as escapes. You can see this if you do:

echo "foo \[bar]\* baz\$"
foo \[bar]\* baz$

You can see that bash interpreted \\$ as an escape sequence representing the character $ , and thus swallowed the \\ character. This is because normally, in double quoted strings $ is a special character that begins a parameter expansion. But it left \\[ and \\* alone because [ and * aren't special inside a double-quoted string, so it interpreted the backslashes as literal \\ characters. To get this expression to work as an argument to grep in a double-quoted string, then, you would have to escape the last backslash:

# This command prints nothing, because bash expands `\$` to just `$`,
# which grep then interprets as an end-of-line anchor.
> echo 'foo [bar]* baz$' | grep "foo \[bar]\* baz\$"

# Escaping the last backslash causes bash to expand `\\$` to `\$`,
# which grep then interprets as matching a literal $ character
> echo 'foo [bar]* baz$' | grep "foo \[bar]\* baz\\$"
foo [bar]* baz$

But note that "foo \\[bar]\\* baz \\\\$" will not work with sed , because sed uses a different regex syntax in which escaping a [ causes it to become a meta-character, whereas in grep you have to escape it to prevent it from being interpreted as a meta-character.

So again, yes, you can escape a literal string for use as a grep regular expression. But if you need to match literal strings containing characters that will need to be escaped, it turns out there's a better way: fgrep .

The fgrep command is really just shorthand for grep -F , where the -F tells grep to match "fixed strings" instead of regular expression. For example:

> echo '[(*\^]$' | fgrep '[(*\^]$'
[(*\^]$

This works because fgrep doesn't know or care about regular expressions. It's just looking for the exact literal string '[(*\\^]$' . However, this sort of puts you back at square one, because fgrep will match on substrings:

> echo '/users/hemanth/dummy' | fgrep '/users/hemanth'
/users/hemanth/dummy

Thankfully, there's a way around this, which it turns out was probably a better approach than my initial answer, considering your specific needs. The -x option to fgrep tells it to only match the entire line. Note that -x is not specific to fgrep (since fgrep is really just grep -F anyway). For example:

> echo '/users/hemanth/dummy' | fgrep -x '/users/hemanth' # prints nothing

This is equivalent to what you would have gotten by escaping the grep regex, and is almost certainly a better answer than my previous answer of enclosing your regex in ^ and $ .

Now, as promised, just in case you want to go this route, here's how you would escape a fixed string to use as a grep regex:

# Suppose we want to match the literal string '^foo.\ [bar]* baz$'
# It contains lots of stuff that grep would normally interpret as
# regular expression meta-characters. We need to escape those characters
# so grep will interpret them as literals.
> str='^foo.\ [bar]* baz$'
> echo "$str"
^foo.\ [bar]* baz$

> regex=$(sed -E 's,[.*^$\\[],\\&' <<< "$str")
> echo "$regex"
\^foo\.\\ \[bar]\* baz\$

> echo "$str" | grep "$regex"
^foo.\ [bar]* baz$
# Success

Again, for the reasons cited above, I don't recommend this approach, especially not when fgrep -x exists.

Read "Anchoring" in man grep :

   Anchoring
       The caret ^ and the dollar sign $ are meta-characters that respectively
       match the empty string at the beginning and end of a line.

Also be aware that . matches any character (from said manual page):

The period . matches any single character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM