简体   繁体   English

正则表达式将单词与首字母大写匹配

[英]Regex to match words with first capital letter

Trying to identify structure for my text data using a regex and hitting road blocks. 尝试使用正则表达式确定文本数据的结构并遇到障碍。

For the sample text below 对于下面的示例文本

I AM A HEADER: 我是标题:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum只是印刷和排版行业的伪文本。 Lorem Ipsum has been the industry's standard dummy text ever since the 1500s. 自1500年代以来,Lorem Ipsum一直是业界标准的伪文本。

I AM A TAB- Lorem Ipsum is simply dummy text of the printing 我是TAB- Lorem Ipsum只是印刷的虚拟文字

My below regular expression picks up 'I AM A HEADER:' and 'I AM A TAB-' 我的以下正则表达式包含“ I AM A HEADER:”和“ I AM A TAB-”

^\s*(?:\b[A-Z]+\b[\s]*)+(?:[:-])\s*$

Please suggest an edit so as to match 'I Am A Header' and 'I Am A Tab' and also ignore the end-markers ':' and '-'. 请提出一个修改建议,以匹配“我是标题”和“我是标签”,并忽略结尾标记“:”和“-”。

You can use 您可以使用

^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])

See regex demo 正则表达式演示

Regex breakdown: 正则表达式细分:

  • ^ - start of string ^ -字符串的开头
  • \\s* - 0 or more whitespace \\s* -0或多个空格
  • (?:\\b[a-zA-Z]+\\b\\s*)+ - 1 or more sequences of (?:\\b[a-zA-Z]+\\b\\s*)+ -1个或多个序列
    • \\b - word boundary (redundant) \\b单词边界(冗余)
    • [a-zA-Z]+ - 1 or more letters [a-zA-Z]+ -1个或更多字母
    • \\b\\s* - 0 or more whitespaces. \\b\\s* -0或多个空格。
  • (?=[:-]) - a lookahead requiring a : or - to be right after the preceding subpattern (?=[:-]) -前瞻,要求:-在上一个子模式之后

The main points here is adding [az] to the [AZ] range, removing \\s*$ and turning (?:...) non-capturing group to the look-ahead (that does not consume characters). 这里的要点是将[az]添加到[AZ]范围,删除\\s*$然后将(?:...)非捕获组转为预读(不消耗字符)。

Here's @stribizhev's solution which worked like charm. 这是@stribizhev的解决方案,它就像魅力一样工作。

^\s*(?:\b[a-zA-Z]+\b\s*)+(?=[:-])

For newbies like me this is a simple explanation of the solution: 对于像我这样的新手,这是解决方案的简单说明:

> ^\s      Anchor to the white space (tab, newline, blankspace)
> *        Look for repetition of white space if any 
> (?:      Start a non-capturing group 
> \b       Look for a word 
> [a-zA-Z] Anyword that starts with either a capital or small alphabet
> +        Look for repeating capital/small alphabets
> \s*      Match a blank space after the word 
> )+       Repeat 
> (?=      Ignore what follows (my best guess) 
> [:-]     Look for ':' or '-'

In essence this regular expression looks for group of words in a newline followed by ':' or '-' 本质上,此正则表达式在换行符中查找单词组,后跟“:”或“-”

Make the above expression smarter by telling it beyond how many words you start loosing relevant information by adding {n, m} as 通过将{n,m}添加为以下内容,使上面的表达式更聪明,方法是告诉您开始失去多少个相关信息

^[\s]*(?:\b[a-zA-Z]+\b[\s]*){1,3}(?=[:-])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式匹配前10个单词中的5个或更多大写字母,Python - Regex match for 5 or more Capital letters in First 10 words, Python 如何使用正则表达式从 NLTK 语料库中查找大写字母单词? - How to find a capital letter words from an NLTK corpus using regex? 在2个大写字母(regex)之前找到以大写字母开头的n个单词 - Find n words starting with capital letter before 2 words of capital letters (regex) 如何查找单词 - 第一个字母大写,其他字母小写 - How to find a words - First letter will be capital & other will be lower Python 正则表达式匹配以大写字母和长度范围开头的字母数字 - Python regex to match alphanumeric starting with capital letter and length range 正则表达式按大写字母和缩写首字母的句点组合 - Regex split on combination of capital letter and period for abbreviated first names 使用正则表达式匹配相邻单词,使用文本 python 中的首字母缩写 - Using regex to match adjacent words using first letter of abbreviations in text python Python 中的正则表达式:在以大写字母开头的两个单词之间分割空白字符 - Regex in Python: splitting on whitespace character in between two words that start with a capital letter 正则表达式 - 在字符串中查找大写单词 - Regex - finding capital words in string Pythonic句子拆分以大写字母开头的单词 - Pythonic sentence splitting on words starting with capital letter
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM