简体   繁体   中英

Regex to match words with first capital letter

Trying to identify structure for my text data using a regex and hitting road blocks.

For the sample text below

I AM A HEADER:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.

I AM A TAB- Lorem Ipsum is simply dummy text of the printing

My below regular expression picks up 'I AM A HEADER:' and 'I AM A TAB-'

^\s*(?:\b[A-Z]+\b[\s]*)+(?:[:-])\s*$

Please suggest an edit so as to match 'I Am A Header' and 'I Am A Tab' and also ignore the end-markers ':' and '-'.

You can use

^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])

See regex demo

Regex breakdown:

  • ^ - start of string
  • \\s* - 0 or more whitespace
  • (?:\\b[a-zA-Z]+\\b\\s*)+ - 1 or more sequences of
    • \\b - word boundary (redundant)
    • [a-zA-Z]+ - 1 or more letters
    • \\b\\s* - 0 or more whitespaces.
  • (?=[:-]) - a lookahead requiring a : or - to be right after the preceding subpattern

The main points here is adding [az] to the [AZ] range, removing \\s*$ and turning (?:...) non-capturing group to the look-ahead (that does not consume characters).

Here's @stribizhev's solution which worked like charm.

^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])

For newbies like me this is a simple explanation of the solution:

> ^\s      Anchor to the white space (tab, newline, blankspace)
> *        Look for repetition of white space if any 
> (?:      Start a non-capturing group 
> \b       Look for a word 
> [a-zA-Z] Anyword that starts with either a capital or small alphabet
> +        Look for repeating capital/small alphabets
> \s*      Match a blank space after the word 
> )+       Repeat 
> (?=      Ignore what follows (my best guess) 
> [:-]     Look for ':' or '-'

In essence this regular expression looks for group of words in a newline followed by ':' or '-'

Make the above expression smarter by telling it beyond how many words you start loosing relevant information by adding {n, m} as

^[\s]*(?:\b[a-zA-Z]+\b[\s]*){1,3}(?=[:-])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM