Regext匹配大写单词，以及周围的+ - 4个单词

Question

I have a bunch of documents and I'm interested in finding mentions of clinical trials. 我有一堆文件，我有兴趣找到临床试验的提及。 These are always denoted by the letters being in all caps (eg ASPIRE). 这些总是用全部大写字母表示（例如ASPIRE）。 I want to match any word in all caps, greater than three letters. 我希望匹配所有大写字母中的任何单词，大于三个字母。 I also want the surrounding +- 4 words for context. 我也想要周围的+ - 4个单词用于上下文。

Below is what I currently have. 以下是我目前的情况。 It kind of works, but fails the test below. 它有点工作，但未通过下面的测试。

import re
pattern = '((?:\w*\s*){,4})\s*([A-Z]{4,})\s*((?:\s*\w*){,4})'
line = r"Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY."
re.findall(pattern, line)

Answer 1

Would the following regex works for you? 以下正则表达式适合您吗？

(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}

Tested here: https://regex101.com/r/nTzLue/1/ 在此测试： https ： //regex101.com/r/nTzLue/1/

Answer 2

On the left side you could match any word character \\w+ one or more times followed by any non word characters \\W+ one or more times. 在左侧，您可以匹配任何单词字符\\w+一次或多次，然后是任何非单词字符\\W+一次或多次。 Combine those two in a non capturing group and repeat that 4 times {4} like (?:\\w+\\W+){4} 将这两个组合在非捕获组中并重复4次{4}如(?:\\w+\\W+){4}

Then capture 3 or more uppercase characters in a group ([AZ]{3,}) . 然后捕获一组中的3个或更多大写字符([AZ]{3,}) 。

Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\\W+\\w+){4} 或者在右侧，您可以将左侧匹配的单词和非单词字符匹配(?:\\W+\\w+){4}

(?:\\w+\\W+){4}([AZ]{3,})(?:\\W+\\w+){4}

The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words. 捕获的组将包含您的大写单词，而捕获组将包含周围的单词。

Answer 3

You may use this code in python that does it in 2 steps. 您可以在python中使用此代码，分两步完成。 First we split input by 4+ letter capital words and then we find upto 4 words on either side of match. 首先，我们将输入分为4个以上的大写单词，然后我们在匹配的两边找到最多4个单词。

import re

str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'

re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'

arr = re.split(re1, str)

result = []

for i in range(len(arr)):
    if i % 2:
        result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )


print result

Code Demo 代码演示

Output: 输出：

[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]

Answer 4

这应该做的工作：

pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'

Regext匹配大写单词，以及周围的+ - 4个单词

问题描述

4 个解决方案

解决方案1
2 2018-05-01 08:55:18

解决方案2
2 2018-05-01 09:38:36

解决方案3
2 已采纳 2018-05-01 10:21:00

解决方案4
1 2018-05-01 08:55:11

Regext匹配大写单词，以及周围的+ - 4个单词

问题描述

4 个解决方案

解决方案1 2 2018-05-01 08:55:18

解决方案2 2 2018-05-01 09:38:36

解决方案3 2 已采纳 2018-05-01 10:21:00

解决方案4 1 2018-05-01 08:55:11

解决方案1
2 2018-05-01 08:55:18

解决方案2
2 2018-05-01 09:38:36

解决方案3
2 已采纳 2018-05-01 10:21:00

解决方案4
1 2018-05-01 08:55:11