简体   繁体   English

与正则表达式重叠匹配 ngrams

[英]Overlapping matches with regex for ngrams

I have a string and need to use regex.我有一个字符串,需要使用正则表达式。

"hello COMMA the matche's roll over matche's or the expression for details PCRE flavors of regex are supported here"

and i want to find bi and trigrams of it.我想找到它的bi和trigrams。 So focusing on bigrams it should pull所以专注于二元组它应该拉

hello COMMA
COMMA the
the matche's
etc

Ive written this regex to do that but its not grabbing the overlapping results.我写了这个正则表达式来做到这一点,但它没有抓住重叠的结果。

[\w'-]+ [\w'-]+

it will only grab它只会抓住

hello COMMA
the matches
etc

when i wrap it in ?= like this it grabs all sorts of trash now.当我把它包起来时?= 像这样它现在会抓住各种垃圾。 What am I missing?我错过了什么?

(?=([\w'-]+ [\w'-]+))

also the overlap=True thing doesnt work for me for somereason重叠=真实的东西也因为某种原因对我不起作用

Do not use regular expressions for text processing.不要使用正则表达式进行文本处理。 There is the package NLTK that was specifically designed for that job:有专门为该工作设计的包 NLTK:

import nltk
text = "hello COMMA the matche's roll over ..."
words = nltk.word_tokenize(text)
list(nltk.bigrams(words))
# [('hello', 'COMMA'), ('COMMA', 'the'), ('the', 'matche'),...]
list(nltk.trigrams(words))
#[('hello', 'COMMA', 'the'), ('COMMA', 'the', 'matche'), ...]

Would you please try the following:请您尝试以下操作:

import re

str = "hello COMMA the matche's roll over matche's or the expression for details PCRE flavors of regex are supported here"

matches = re.finditer(r'\S+\s(?=(\S+))', str)
for match in matches:
    print(match.group(0) + match.group(1))

Output:输出:

hello COMMA
COMMA the
the matche's
matche's roll
[snipped]

The regex (?=(\\S+)) includes a capture group within the positive lookahead assertion.正则表达式(?=(\\S+))在正向前瞻断言中包含一个捕获组。 It assigns match.group(1) to the matched substring without moving the position forward thanks to the zero-width matching.由于零宽度匹配,它将match.group(1)分配给匹配的子字符串而不向前移动位置。

The regular expression below is a generalisation and simplification of the regex suggested in a comment on the question by @Wiktor.下面的正则表达式是对@Wiktor 对该问题的评论中建议的正则表达式的概括和简化。 Wiktor's solution was for 2-grams (or bigrams). Wiktor 的解决方案是 2-grams(或 bigrams)。 This solution is for 3-grams (or trigrams).此解决方案适用于 3-gram(或 trigram)。 For n-grams, where n is a variable, replace {2} with {#{n-1}} .对于 n-gram,其中n是一个变量,将{2}替换为{#{n-1}}

First assume that the string contains only word characters and whitespace.首先假设字符串只包含单词字符和空格。 The following regex can then be used to extract the trigrams:然后可以使用以下正则表达式来提取三元组:

(?=(?<!\S)(\w+(?:\s+\w+){2}))

Example例子

The regex can be broken down as follows:正则表达式可以分解如下:

(?=           # begin a positive lookahead   
  (?<!        # begin a negative lookbehind
    \S        # match a a non-whitespace char
  )           # end the negative lookbehind
  (           # begin capture group 1
    \w+       # match 1+ word chars
    (?:       # begin a non-capture group
      \s+\w+  # match 1+ whitespace chars followed by 1+ word chars
    )         # end non-capture group
    {1,2}     # execute the non-capture group 1-2 times 
  )           # end capture group
)             # end positive lookahead

If, as in the example, the string may also contain apostrophes within words (but not at the beginning or end of a word), each token \\w+ above can be replaced with \\w+(?:[']\\w+)* to obtain:如果,如在示例中,字符串也可能包含单词内的撇号(但不在单词的开头或结尾),则上面的每个标记\\w+都可以替换为\\w+(?:[']\\w+)*以获得:

(?=(?<!\S)((?:\w+(?:[']\w+)*(?:\s+\w+(?:[']\w+)*){1,2})))

Example例子

The regex quickly breaks down, however, if too much is asked for the possible numbers and locations of certain characters.但是,如果对某些字符的可能数字和位置要求太多,则正则表达式很快就会崩溃。

This is an example of a situation where a regex should not be used, as the desired array can be produced much more easily with other tools.这是其中一个正则表达式应该被使用,因为所期望的阵列可更容易地与其他工具来制造的情况下的例子。 It is a useful exercise, however, as it does sharpens one's facility with regular expressions.然而,这是一个有用的练习,因为它确实提高了使用正则表达式的能力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM