简体   繁体   中英

Add min char and a way to find words with first letter capitalized to a regex

Hi guys have the following regex:

/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/

I've tried in different way, but i'm not a pro with regex..so, this is what want to do:

  1. Add a rule that match only 3+ characters words.
  2. Add a rule that can match name like "Institute of Technology" (so, three words with a lowercase word between the first and the last)

Can you help me to do that? (I should do different regex, am i right?)

In order to help you to understand, this is what you have:

  • [AZ] : one character in the class AZ
  • [\\w-]* : a concatenation of zero or more word character or hypens
  • (...)+ : one or more:
    • \\s+ : at least one space
    • [AZ] : one character in the class AZ
    • [\\w-]* : a concatenation of zero or more word character or hypens

This is what you want:

  • [AZ] : a capital letter
  • [\\w-]* : a concatenation of zero or more word character or hypens
  • \\s+ : at least one space
  • [az] : a lower-case letter
  • [\\w-]* : a concatenation of zero or more word character or hypens
  • \\s+ : at least one space
  • [AZ] : a capital letter
  • [\\w-]* : a concatenation of zero or more word character or hypens

That is:

[A-Z][\w-]*\s+[a-z][\w-]*\s+[A-Z][\w-]*

You may want to do some small changes. I think you can do them by your own.


A rule that matches only 3+ characters word is \\w{3,} . If you want to capitalize the first character use [AZ]\\w{2,} .

(\\w\\w\\w+)|(\\w+ [az]+ \\w+) - This code searches for a word consisting of at least 3 letters OR a word with at least 1 sign, space, small letters, 1+ signs. You can switch \\w with [AZ] if necessary. If your 3 word phrase has to have 2 words with capital letters, change the second brackets to ([AZ]\\w* [az]+ [AZ]\\w*) . Try it here: https://regex101.com/r/E3IPTj/1

Not sure on the scope of your limitations but a few 'building blocks' might help. Also id suggest just starting at the beginning I don't know any recent websites that handle learning regex well but when I started I used the following http://www.regular-expressions.info/tutorial.html (It's been many years, and the website does reflect its age so to speak)

However onto your regex:

Following your example: Institute of Technology

You need to know just a few things, character sets (and how to use matching length) and the space.

Character sets match one length (by default) and are done like for example [abc] that will match a, b, or c, and also supports character ranges (az)/grouped (eg. \\d all digits). The match length can be changed by using the:

  • + - one or more (examples: a+, [abc]+, \\d+)
  • * - zero or more (examples: a*, [abc]*)

And this one you might want but thats up to you

  • {min, max} - specific range, eg. b{3,5} will match 3-5 joined 'b' characters (bbb, bbbb, bbbbb) max can be omitted `{min,} to have at least min chars but no max

Spaces are done using " " (a space), ( \\s matches any whitespace character (equal to [\\r\\n\\t\\f\\v ] ) (spaces, tabs, newlines, ...)

In your example its a matter of case sensitive or not if not case sensitive we can use a simple [A-Za-z]+ to match upper and lowercase az of at least one length, together with the space we get something along the lines of

/[A-Za-z]+ [A-Za-z]+ [A-Za-z]+/

It's that simple. For case insensitive matching there is also an option flag, we can use i which will result in

/[a-z]+ [a-z]+ [a-z]+/i

If you do want to have case sensitive matching you will need to separate them how you like:

/[A-Z][a-z]* [a-z]+ [A-Z][a-z]*/ // (*A a A*)

As a small change I've also changed + into * so the lowercase part is not required, again up to you.

Also note that to match the beginning of a string your required to use ^ and to match the end of a string use $ the above examples will match any segment, not the whole input eg: qhg8Institute of Technology8tghagus would work

So final result:

/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/ // case sensitive (Aa a Aa)
/^[a-z]+ [a-z]+ [a-z]+$/i          // case insensitive

Obviously there is lots more to learn that can be used to expand/ optimize this but regex are so customizable its really up to the person needing them to specify his/ her limitations/ requirements.


As a side note I noticed people using \\w for word chars, but this also includes digits, _, and special language letters like à, ü, etc. Again up to you what to do with this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM