简体   繁体   中英

C# regex match only parts of complete words in string

Before asking this question I have Googled for this problem and I have looked through all StackOverflow related questions.

The problem is pretty simple

I have a string "North Atlantic Treaty Organization"

I have a pattern "a.*z", at moment it would match

north ATLATIC TREATY ORGANIZation

But I need it to match complete words only (orgANIZation for example)

I have tried "\\ba z\\b" and "\\Ba z\\B" as pattern, but I think I don't quite get it

How should I change my pattern in order to match complete words that string contains (without matching multiple words)

The patterns are generated on the fly, user enteres a*z and my application translates it into pattern that matches parts of complete words in string.

My problem is that I don't know what user is going to search for. Ideally I would preppend some regexp to user's expression.

Thank You!

ANIZ in orgANIZation is not a complete word -- it's a part of a word. Your pattern btw is not what you wrote -- a*z would not match as you describe; you're probably using a.*z instead, which would. So, try a[^ ]*z so it won't match spaces. If there are other characters besides spaces that you don't want to match, eg some kinds of punctuation, stick them in the [^...] construct as well, of course.

"a[^\s]*z"

This means an 'a' followed by any number of non-whitespace characters, followed by a 'z'.

EDIT: You seem to want ' * ' to be interpreted as a wildcard character. The user is thus not to enter a regex, but a string with certain wildcards. You can translate these wildcard characters to regex by reasoning over the intended meaning. Let's say that ' * ' should mean "zero or more characters that are not whitespace". You replace this character, then, with the corresponding regex:

[^\s]*
                       `-.-´|
     Character class-----´  `---Zero or more of these

     '\s': "Whitespace"
     Inside Character class: if it starts with '^': "not"

You might also want to define '?' as matching exactly a single non-whitespace character. This is the same character class, but you omit the '*' at the end.

So, what you do is regex-replace " * " with " [^\\s]* " and " ? " with " [^\\s] ".

that is what you are looking for:

new Regex( @"\b[^ ]*a[^ ]*z[^ ]*\b" );

it matches only a single word (no spaces are allowed) - but the whole one. You can translate your users input into such an regex - just replace * by [^ ]* - it works even with more than one wildcard.

Not related to your question directly, but you may want to check out a RegEx visualization tool which shows you the caputred results based on text input and a given regular expression.

Such a tool is very helpful to find the right pattern, which can be quite tricky. A nice tool specialized for .net RegEx is RegExLab , a bit older but does a good job in showing what exactly your regular expression matches. Since the page is in German, just click on the regexlab.006.zip link. Source code is also included.

Regex reWord = new Regex("\\b[A-Za-z]*?(a.*z)[A-Za-z]*\\b");

... this will return "Atlantic Treaty Organization", with the capture from a. * z being "antic Treaty Organiz".

The problem is inherent in your method - unless you parse the user supplied "regex" of a * z (or a. * z, that's not quite clear from your post) by modifing * to [^\\s] * ? as Svante suggests (or perhaps \\w * ?), you're going to gobble up far more characters than you like.

". * " is, generally speaking, a bad idea when you're trying to be specific. It'll match everything but a newline, and there's nothing you can append to it that will stop that.

Regex reWord = new Regex("\\b\\w*?(a\\w*?z)\\w*\\b");

...will return just "Organization".

Alternatively, if you absolutely must , for whatever reason, avoid modifying the user supplied regex, perhaps try spliting your strings into an array of words and test each word individually against the regex.

Ultimately, it's GIGO - garbage in, garbage out. Feed your system a bad regex and if you don't fix it appropriately, you won't get what you're looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM