简体   繁体   中英

Match vocabulary words and phrases

I am writing an application/logic that has vocabulary word / phrase as an input parameter. I am having troubles writing validation logic for this parameter's value !

Following are the rules I've came up with:

  • can be up to 4 words (with hyphens or not)
  • one apostrophe is allowed
  • only regular letters are allowed (no special characters like;@#$%^&*()={}[]""?|/>/? ¶ © etc)
  • numbers are disallowed
  • case insensitive
  • multiple languages support (English, Russian, Norwegian, etc..) (so both Unicode and Cyrillic must be supported)
  • either whole string matches or nothing

Few examples (in 3 languages):

// match:
one two three four
one-two-three-four
one-two-three four
vær så snill
тест регекс
re-read
under the hood
ONe
rabbit's lair

// not-match:
one two three four five
one two three four@
one-two-three-four five
rabbit"s lair
one' two's
one1
1900

Given the expected result provided above - could someone point me to right direction on how to create a validation rule like that? If that matters - I will be writing validation logic in C# so I have more tools than just Regex available at my disposal.

If that is going to be of any help - I have been testing several solutions, like these ^[\p{Ll}\p{Lt}]+$ and (?=\S*['-])([a-zA-Z'-]+)$ . The first regex seems to be doing a great job allowing just the letters I need (En, No and Rus), whereas the second rule set is doing great in using the Lookahead concept.

  • \p{Ll} or \p{Lowercase_Letter} : a lowercase letter that has an uppercase variant.
  • \p{Lu} or \p{Uppercase_Letter} : an uppercase letter that has a lowercase variant.
  • \p{Lt} or \p{Titlecase_Letter} : a letter that appears at the start of a word when only the first letter of the word is capitalized.
  • \p{L&} or \p{Letter&} : a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
  • \p{Lm} or \p{Modifier_Letter} : a special character that is used like a letter.
  • \p{Lo} or \p{Other_Letter} : a letter or ideograph that does not have lowercase and uppercase variants.

Needless to say, neither of the solutions I have been testing take into account all the rules I defined above..

You can use

\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+){0,3}\z

See the regex demo . Details :

  • \A - start of string
  • (??(::[^']*'){2}) - the string cannot contain two apostrophes
  • \p{L}+ - one or more Unicode letters
  • (?:[\s'-]\p{L}+){0,3} - zero to three occurrences of
    • [\s'-] - a whitespace, ' or - char
    • \p{L}+ - one or more Unicode letters
  • \z - the very end of string.

In C#, you can use it as

var IsValid = Regex.IsMatch(text, @"\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+");{0,3}\z")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM