简体   繁体   English

正则表达式完整单词模式

[英]Regex complete words pattern

I want to get patterns involving complete words, not pieces of words. 我想获得涉及完整单词而不是单词片段的模式。 Eg 12345 [some word] 1234567 [some word] 123 1679 . 例如12345 [some word] 1234567 [some word] 123 1679 Random text and the pattern appears again 1111 123 [word] 555 . 随机文本和模式再次出现1111 123 [word] 555

This should return 这应该返回

[[12345, 1234567, 123, 1679],[1111, 123, 555]]

I am only tolerating one word between the numbers otherwise the whole string would match. 我只能容忍数字之间的一个字,否则整个字符串都将匹配。 Also note that it is important to capture that 2 matches were found and so a two-element list was returned. 另请注意,捕获找到的2个匹配项非常重要,因此返回了一个由两个元素组成的列表。

I am running this in python3. 我在python3中运行它。 I have tried: 我努力了:

\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b

but I am not sure how to scale this to an unrestricted number of matches. 但我不确定如何将其扩展到不受限制的比赛数量。

re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', string)

This matches [number] [word] [number] but not any number that might follow with or without a word in between. 这与[number] [word] [number]匹配,但不匹配[number] [word] [number]可能带有或没有单词的任何数字。

You can't do this in one operation with the Python re engine. 您无法使用Python re引擎一次完成此操作。
But you could match the sequence with one match, then extract the 但是您可以将序列与一个匹配项匹配,然后提取
digits with another. 与另一个数字。

This matches the sequence 这符合顺序

r"(?<!\\w)\\d+(?:(?:[^\\S\\r\\n]+[a-zA-Z](?:\\w*[a-zA-Z])*)?[^\\S\\r\\n]+\\d+)*(?!\\w)"

https://regex101.com/r/73AYLU/1 https://regex101.com/r/73AYLU/1

Explained 解释

 (?<! \w )                     # Not a word behind
 \d+                           # Many digits
 (?:                           # Optional word block
      (?:                           # Optional words
           [^\S\r\n]+                    # Horizontal whitespace
           [a-zA-Z]                      # Starts with a letter
           (?: \w* [a-zA-Z] )*           # Can be digits in middle, ends with a letter
      )?                            # End words, do once
      [^\S\r\n]+                    # Horizontal whitespace
      \d+                           # Many digits
 )*                            # End word block, do many times
 (?! \w )                      # Not a word ahead

This gets the array of digits from the sequence matched above (use findall) 这将从上面匹配的序列中获取数字数组(使用findall)

r"(?<!\\S)(\\d+)(?!\\S)"

https://regex101.com/r/BHov38/1 https://regex101.com/r/BHov38/1

Explained 解释

 (?<! \S )              # Whitespace boundary
 ( \d+ )                # (1)
 (?! \S )               # Whitespace boundary

Are you expecting re.findall() to return a list of lists? 您是否希望re.findall()返回列表列表? It will only return a list - no matter what regex you use. 它将仅返回一个列表-无论您使用什么正则表达式。

One approach is to split your input string into sentences and then loop through them 一种方法是将输入字符串拆分为句子,然后遍历它们

import re
inputArray = re.split('<pattern>',inputText)
outputArray = []
for item in inputArray:
    outputArray.append(re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', item))

the trick is to find a <pattern> to split your input. 诀窍是找到一个<pattern>来分割您的输入。

This is a bit complicated, maybe this expression would be just something to look into: 这有点复杂,也许这个表达式只是需要研究的内容:

(((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+)

and script the rest of the problem for a valid solution. 并编写其余问题的脚本,以找到有效的解决方案。

Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM