字符串掩码和正则表达式的偏移量

Question

I have a string on which I try to create a regex mask that will show N number of words, given an offset. 我有一个字符串，我尝试创建一个正则表达式掩码，给出一个偏移量将显示N个单词。 Let's say I have the following string: 假设我有以下字符串：

"The quick, brown fox jumps over the lazy dog."

I want to show 3 words at the time: 我想在当时显示3个单词：

offset 0 : "The quick, brown" 偏移0 ： "The quick, brown"
offset 1 : "quick, brown fox" 偏移1 ： "quick, brown fox"
offset 2 : "brown fox jumps" 抵消2 ： "brown fox jumps"
offset 3 : "fox jumps over" 抵消3 ： "fox jumps over"
offset 4 : "jumps over the" 偏移4 ： "jumps over the"
offset 5 : "over the lazy" 抵消5 ： "over the lazy"
offset 6 : "the lazy dog." 抵消6 ： "the lazy dog."

I'm using Python and I've been using the following simple regex to detect 3 words: 我正在使用Python，我一直在使用以下简单的正则表达式来检测3个单词：

>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\\w+\\W*){3}', s).group()
'The quick, brown '

But I can't figure out how to have a kind of mask to show the next 3 words and not the beginning ones. 但我无法弄清楚如何使用一种面具来显示接下来的3个单词而不是开始的单词。 I need to keep punctuation. 我需要保持标点符号。

Answer 1

The prefix-matching option 前缀匹配选项

You can make this work by having a variable-prefix regex to skip the first offset words, and capturing the word triplet into a group. 您可以通过使用变量前缀regex跳过第一个offset字，并将单词triplet捕获到一个组中来完成此工作。

So something like this: 所以像这样：

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps

Let's take a look at the pattern: 我们来看看模式：

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

This does what it says: match 2 words, then capturing into group 1, match 3 words. 这就是它所说的：匹配2单词，然后捕获到组1，匹配3单词。

The (?:...) constructs are used for grouping for the repetition, but they're non-capturing. (?:...)构造用于重复分组，但它们是非捕获的。

References 参考

regular-expressions.info/Capturing Groups, Non-capturing Groups regular-expressions.info/Capturing Groups，Non-captured Groups
- Repeating a Capturing Group vs Capturing a Repeated Group 重复捕获组与捕获重复组

Note on "word" pattern 关于“单词”模式的注释

It should be said that \\w+\\W* is a poor choice for a "word" pattern, as exhibited by the following example: 应该说\\w+\\W*对于“单词”模式来说是一个糟糕的选择，如下例所示：

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

There are no 3 words, but the regex was able to match anyway, because \\W* allows for an empty string match. 没有3个单词，但正则表达式无论如何都能匹配，因为\\W*允许空字符串匹配。

Perhaps a better pattern is something like: 也许更好的模式是这样的：

\w+(?:\W+|$)

That is, a \\w+ that is followed by either a \\W+ or the end of the string $ . 也就是说， \\w+后跟一个\\W+或字符串$的结尾。

The capturing lookahead option 捕获前瞻选项

As suggested by Kobi in a comment, this option is simpler in that you only have one static pattern. 正如Kobi在评论中所建议的那样，这个选项更简单，因为你只有一个静态模式。 It uses findall to capture all matches ( see on ideone.com ): 它使用findall捕获所有匹配项（请参阅ideone.com ）：

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over

How this works is that it matches on zero-width word boundary \\b , using lookahead to capture 3 "words" in group 1. 这是如何工作的，它匹配零宽度字边界\\b ，使用先行来捕获组1中的3个“单词”。

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

References 参考

regular-expressions.info/Lookarounds regular-expressions.info/Lookarounds

Answer 2

One slant would be to split the string and select slices: 一个倾向是拆分字符串并选择切片：

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

This does, of course, assume that you either have only single spaces between words, or don't care if all whitespace sequences are folded into single spaces. 当然，这确实假设您在单词之间只有单个空格，或者不关心是否所有空格序列都折叠成单个空格。

Answer 3

No need for regex 不需要正则表达式

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

Answer 4

We have two orthogonal issues here: 我们在这里有两个正交问题：

How to split the string. 如何拆分字符串。
How to build groups of 3 consecutive elements. 如何构建3个连续元素的组。

For 1 you could use regular expressions or -as others have pointed out- a simple str.split should suffice. 对于1你可以使用正则表达式，或者其他人指出 - 一个简单的str.split就足够了。 For 2, note that you want looks very similar to the pairwise abstraction in itertools's recipes: 对于2，请注意您希望看起来与itertools的配方中的pairwise抽象非常相似：

http://docs.python.org/library/itertools.html#recipes http://docs.python.org/library/itertools.html#recipes

So we write our generalized n-wise function: 所以我们编写了广义的n次函数：

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

And we end up with a simple and modularized code: 最后我们得到了一个简单的模块化代码：

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

Or as you requested: 或者按照您的要求：

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

字符串掩码和正则表达式的偏移量

问题描述

4 个解决方案

解决方案1
5 已采纳 2010-07-18 11:48:01

The prefix-matching option 前缀匹配选项

References 参考

Note on "word" pattern 关于“单词”模式的注释

The capturing lookahead option 捕获前瞻选项

References 参考

解决方案2
2 2010-07-18 11:37:19

解决方案3
1 2010-07-18 13:27:06

解决方案4
1 2010-07-18 15:22:15

字符串掩码和正则表达式的偏移量

问题描述

4 个解决方案

解决方案1 5 已采纳 2010-07-18 11:48:01

The prefix-matching option 前缀匹配选项

References 参考

Note on "word" pattern 关于“单词”模式的注释

The capturing lookahead option 捕获前瞻选项

References 参考

解决方案2 2 2010-07-18 11:37:19

解决方案3 1 2010-07-18 13:27:06

解决方案4 1 2010-07-18 15:22:15

解决方案1
5 已采纳 2010-07-18 11:48:01

解决方案2
2 2010-07-18 11:37:19

解决方案3
1 2010-07-18 13:27:06

解决方案4
1 2010-07-18 15:22:15