简体   繁体   中英

Regex match exact word not preceded or followed by other characters

im trying to make a regex for matching a set of words.

For example, if i am matching a set of words - American Tea

Then in the string American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea there will be only 2 matches here,

' American Tea is awesome. Do you like American Tea ? love WowAmerican Tea #American Tea'

So, i am trying to do only full matches of the word set.

I tried some approaches, but havent got the correct regex :( If anyone can help or can point me in a direction it would be really helpful.

Check this

'American Tea lalalal qwqwqw American Tea sdsdsd #American Tea'.match(/(?:^|\\s)(American Tea)(?=\\s|$)/g)

the result of this is ["American Tea", " American Tea"]

I do not want the space in the second match, i want the match result to be ["American Tea", "American Tea"]

(no space in front of the second American Tea)

Use .replace() for fun and profit

/(?:^|\s)(american tea)/ig

https://regex101.com/r/qB0uO2/1

if you want to account for prefixes AND suffixes:

/(?:^|\s)(american tea)(?:\W|$)/ig 

https://regex101.com/r/qB0uO2/2

JSBIN EXAMPLE

var str = "American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea";

str.replace(/(?:^|\s)(american tea)(?:\W|$)/ig, function(i, m){
  console.log(m);
});

//"American Tea"
//"American Tea"

EDIT:

The above returns only the matches, if instead you want to preserve the capturing and matching prefixes and suffixes use capturing-groups for them aswell :

 var str = "American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea"; var newStr = str.replace(/(^|\\s)(american tea)(\\W|$)/ig, function(im, p1, p2, p3){ return p1 +"<b>"+ p2 +"</b>"+ p3; // p1 and p3 will help preserve the pref/suffix }); document.getElementById("result").innerHTML = newStr; 
 <div id="result"></div> 

where the p arts

  • p1 is the first matching group (any prefix)
  • p2 is the second matching group (the "American Tea" word)
  • p3 is the third matching group (any suffix)

Reading the comments I realized that a regex might not be the best solution for this. However, it is pretty interesing how you would circumvent the fact that Javascript does not support a positive lookbehind which would make this task easy.

If JS had the (?<=...) construct, then you would just use a positive lookbehind and a positive lookahead and list all the characters which you want to allow to the left and right of American Tea. So what we want is something like this:

(?<=\s|\.|,|:|;|\?|\!|^)American Tea(?=\s|\.|,|:|;|\?|\!|$)

To the left, you would allow any of the listed characters and the start of the string ^. To the right, you allow the same characters and the end of the string $.

But Javascript does not have the (?<=...) construct. So we will have to get a little creative:

(?=(\s|\.|,|:|;|\?|\!|^))\1(American Tea)(?=\s|\.|,|:|;|\?|\!|$)

This regex substitutes the positive lookbehind with a positive lookahead. Then it matches whatever it has found in the lookahead with \\1 and finally American Tea will be in capturing group 1.

Demo: https://regex101.com/r/qX9qR3/3

You don't need regexes to match words.

I know a very neat CoffeeScript snippet :

wordList = ["coffeescript", "eko", "talking", "play framework", "and stuff", "falsy"]
tweet = "This is an example tweet talking about javascript and stuff."

wordList.some (word) -> ~tweet.indexOf word # returns true

Which compiles into the following javascript :

var tweet, wordList;

wordList = ["coffeescript", "eko", "talking", "play framework", "and stuff", "falsy"];

tweet = "This is an example tweet talking about javascript and stuff.";

wordList.some(function(word) { // returns true
  return ~tweet.indexOf(word); 
});

~ is not a special operator in CoffeeScript, just a cool trick. It is the bitwise NOT operator, which inverts the bits of its operand. In practice it equates to -x-1. Here it works on the basis that we want to check for an index greater than -1, and -(-1)-1 == 0 evaluates to false.

If you want the words that are matched, use :

wordList.filter (word) -> ~tweet.indexOf word # returns : [ "talking", "and stuff" ]

Or the same in JS :

wordList.filter(function(word) { // returns : [ "talking", "and stuff" ]
  return ~tweet.indexOf(word);
});

While Jeremy is of course right, I assume there is more to your problem than visible in your contrived example.

From what it looks like you're trying to have regular RegEx word boundaries with the exception that you consider "#" part of the word characters. In that case you can do something like this: (where \\b means "word boundary")

(^|[^#])\bAmerican Tea\b

Or, if you simply want to list the characters that you consider non word characters you can do something like this to simulate word boundaries:

(^|[^A-Za-z])American Tea($|[^A-Za-z])

You can play around eg at http://www.regexr.com/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM