[英]Regex to match words with first capital letter
Trying to identify structure for my text data using a regex and hitting road blocks. 尝试使用正则表达式确定文本数据的结构并遇到障碍。
For the sample text below 对于下面的示例文本
I AM A HEADER:
我是标题:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.Lorem Ipsum只是印刷和排版行业的伪文本。 Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.
自1500年代以来,Lorem Ipsum一直是业界标准的伪文本。
I AM A TAB- Lorem Ipsum is simply dummy text of the printing
我是TAB- Lorem Ipsum只是印刷的虚拟文字
My below regular expression picks up 'I AM A HEADER:' and 'I AM A TAB-' 我的以下正则表达式包含“ I AM A HEADER:”和“ I AM A TAB-”
^\s*(?:\b[A-Z]+\b[\s]*)+(?:[:-])\s*$
Please suggest an edit so as to match 'I Am A Header' and 'I Am A Tab' and also ignore the end-markers ':' and '-'. 请提出一个修改建议,以匹配“我是标题”和“我是标签”,并忽略结尾标记“:”和“-”。
You can use 您可以使用
^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])
See regex demo 见正则表达式演示
Regex breakdown: 正则表达式细分:
^
- start of string ^
-字符串的开头 \\s*
- 0 or more whitespace \\s*
-0或多个空格 (?:\\b[a-zA-Z]+\\b\\s*)+
- 1 or more sequences of (?:\\b[a-zA-Z]+\\b\\s*)+
-1个或多个序列
\\b
- word boundary (redundant) \\b
单词边界(冗余) [a-zA-Z]+
- 1 or more letters [a-zA-Z]+
-1个或更多字母 \\b\\s*
- 0 or more whitespaces. \\b\\s*
-0或多个空格。 (?=[:-])
- a lookahead requiring a :
or -
to be right after the preceding subpattern (?=[:-])
-前瞻,要求:
或-
在上一个子模式之后 The main points here is adding [az]
to the [AZ]
range, removing \\s*$
and turning (?:...)
non-capturing group to the look-ahead (that does not consume characters). 这里的要点是将
[az]
添加到[AZ]
范围,删除\\s*$
然后将(?:...)
非捕获组转为预读(不消耗字符)。
Here's @stribizhev's solution which worked like charm. 这是@stribizhev的解决方案,它就像魅力一样工作。
^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])
For newbies like me this is a simple explanation of the solution: 对于像我这样的新手,这是解决方案的简单说明:
> ^\s Anchor to the white space (tab, newline, blankspace)
> * Look for repetition of white space if any
> (?: Start a non-capturing group
> \b Look for a word
> [a-zA-Z] Anyword that starts with either a capital or small alphabet
> + Look for repeating capital/small alphabets
> \s* Match a blank space after the word
> )+ Repeat
> (?= Ignore what follows (my best guess)
> [:-] Look for ':' or '-'
In essence this regular expression looks for group of words in a newline followed by ':' or '-' 本质上,此正则表达式在换行符中查找单词组,后跟“:”或“-”
Make the above expression smarter by telling it beyond how many words you start loosing relevant information by adding {n, m} as 通过将{n,m}添加为以下内容,使上面的表达式更聪明,方法是告诉您开始失去多少个相关信息
^[\s]*(?:\b[a-zA-Z]+\b[\s]*){1,3}(?=[:-])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.