简体   繁体   English

字符串掩码和正则表达式的偏移量

[英]string mask and offset with regex

I have a string on which I try to create a regex mask that will show N number of words, given an offset. 我有一个字符串,我尝试创建一个正则表达式掩码,给出一个偏移量将显示N个单词。 Let's say I have the following string: 假设我有以下字符串:

"The quick, brown fox jumps over the lazy dog."

I want to show 3 words at the time: 我想在当时显示3个单词:

offset 0 : "The quick, brown" 偏移0"The quick, brown"
offset 1 : "quick, brown fox" 偏移1"quick, brown fox"
offset 2 : "brown fox jumps" 抵消2"brown fox jumps"
offset 3 : "fox jumps over" 抵消3"fox jumps over"
offset 4 : "jumps over the" 偏移4"jumps over the"
offset 5 : "over the lazy" 抵消5"over the lazy"
offset 6 : "the lazy dog." 抵消6"the lazy dog."

I'm using Python and I've been using the following simple regex to detect 3 words: 我正在使用Python,我一直在使用以下简单的正则表达式来检测3个单词:

>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\\w+\\W*){3}', s).group()
'The quick, brown '

But I can't figure out how to have a kind of mask to show the next 3 words and not the beginning ones. 但我无法弄清楚如何使用一种面具来显示接下来的3个单词而不是开始的单词。 I need to keep punctuation. 我需要保持标点符号。

The prefix-matching option 前缀匹配选项

You can make this work by having a variable-prefix regex to skip the first offset words, and capturing the word triplet into a group. 您可以通过使用变量前缀regex跳过第一个offset字,并将单词triplet捕获到一个组中来完成此工作。

So something like this: 所以像这样:

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps 

Let's take a look at the pattern: 我们来看看模式:

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

This does what it says: match 2 words, then capturing into group 1, match 3 words. 这就是它所说的:匹配2单词,然后捕获到组1,匹配3单词。

The (?:...) constructs are used for grouping for the repetition, but they're non-capturing. (?:...)构造用于重复分组,但它们是非捕获的。

References 参考


Note on "word" pattern 关于“单词”模式的注释

It should be said that \\w+\\W* is a poor choice for a "word" pattern, as exhibited by the following example: 应该说\\w+\\W*对于“单词”模式来说是一个糟糕的选择,如下例所示:

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

There are no 3 words, but the regex was able to match anyway, because \\W* allows for an empty string match. 没有3个单词,但正则表达式无论如何都能匹配,因为\\W*允许空字符串匹配。

Perhaps a better pattern is something like: 也许更好的模式是这样的:

\w+(?:\W+|$)

That is, a \\w+ that is followed by either a \\W+ or the end of the string $ . 也就是说, \\w+后跟一个\\W+或字符串$的结尾。


The capturing lookahead option 捕获前瞻选项

As suggested by Kobi in a comment, this option is simpler in that you only have one static pattern. 正如Kobi在评论中所建议的那样,这个选项更简单,因为你只有一个静态模式。 It uses findall to capture all matches ( see on ideone.com ): 它使用findall捕获所有匹配项( 请参阅ideone.com ):

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over 

How this works is that it matches on zero-width word boundary \\b , using lookahead to capture 3 "words" in group 1. 这是如何工作的,它匹配零宽度字边界\\b ,使用先行来捕获组1中的3个“单词”。

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

References 参考

One slant would be to split the string and select slices: 一个倾向是拆分字符串并选择切片:

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

This does, of course, assume that you either have only single spaces between words, or don't care if all whitespace sequences are folded into single spaces. 当然,这确实假设您在单词之间只有单个空格,或者不关心是否所有空格序列都折叠成单个空格。

No need for regex 不需要正则表达式

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

We have two orthogonal issues here: 我们在这里有两个正交问题:

  1. How to split the string. 如何拆分字符串。
  2. How to build groups of 3 consecutive elements. 如何构建3个连续元素的组。

For 1 you could use regular expressions or -as others have pointed out- a simple str.split should suffice. 对于1你可以使用正则表达式,或者其他人指出 - 一个简单的str.split就足够了。 For 2, note that you want looks very similar to the pairwise abstraction in itertools's recipes: 对于2,请注意您希望看起来与itertools的配方中的pairwise抽象非常相似:

http://docs.python.org/library/itertools.html#recipes http://docs.python.org/library/itertools.html#recipes

So we write our generalized n-wise function: 所以我们编写了广义的n次函数:

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

And we end up with a simple and modularized code: 最后我们得到了一个简单的模块化代码:

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

Or as you requested: 或者按照您的要求:

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM