Trying to identify structure for my text data using a regex and hitting road blocks.
For the sample text below
I AM A HEADER:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.I AM A TAB- Lorem Ipsum is simply dummy text of the printing
My below regular expression picks up 'I AM A HEADER:' and 'I AM A TAB-'
^\s*(?:\b[A-Z]+\b[\s]*)+(?:[:-])\s*$
Please suggest an edit so as to match 'I Am A Header' and 'I Am A Tab' and also ignore the end-markers ':' and '-'.
You can use
^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])
See regex demo
Regex breakdown:
^
- start of string \\s*
- 0 or more whitespace (?:\\b[a-zA-Z]+\\b\\s*)+
- 1 or more sequences of
\\b
- word boundary (redundant) [a-zA-Z]+
- 1 or more letters \\b\\s*
- 0 or more whitespaces. (?=[:-])
- a lookahead requiring a :
or -
to be right after the preceding subpattern The main points here is adding [az]
to the [AZ]
range, removing \\s*$
and turning (?:...)
non-capturing group to the look-ahead (that does not consume characters).
Here's @stribizhev's solution which worked like charm.
^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])
For newbies like me this is a simple explanation of the solution:
> ^\s Anchor to the white space (tab, newline, blankspace)
> * Look for repetition of white space if any
> (?: Start a non-capturing group
> \b Look for a word
> [a-zA-Z] Anyword that starts with either a capital or small alphabet
> + Look for repeating capital/small alphabets
> \s* Match a blank space after the word
> )+ Repeat
> (?= Ignore what follows (my best guess)
> [:-] Look for ':' or '-'
In essence this regular expression looks for group of words in a newline followed by ':' or '-'
Make the above expression smarter by telling it beyond how many words you start loosing relevant information by adding {n, m} as
^[\s]*(?:\b[a-zA-Z]+\b[\s]*){1,3}(?=[:-])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.