简体   繁体   English

需要 C# 正则表达式来获取句子中的单词对

[英]Need C# Regex to get pairs of words in a sentence

Is there a regex that would take the following sentence:是否有一个正则表达式可以使用以下句子:

"I want this split up into pairs" “我想把它分成几对”

and generate the following list:并生成以下列表:

"I want", "want this", "this split", "split up", "up into", "into pairs" “我想要”、“想要这个”、“这个分裂”、“分裂”、“成对”、“成对”

Since words need to be re-used, you need lookahead assertions:由于需要重复使用单词,因此您需要先行断言:

Regex regexObj = new Regex(
    @"(     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (\w+)  # another word; capture that into backref 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[1].Value + matchResult.Groups[2].Value);
    matchResult = matchResult.NextMatch();
}

For groups of threes:三人一组:

Regex regexObj = new Regex(
    @"(     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (      # and capture...
      \w+   # another word,
      \s+   # whitespace,
      \w+   # word.
     )      # End of capturing group 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);

etc.等等

You could do你可以做

var myWords = myString.Split(' ');

var myPairs = myWords.Take(myWords.Length - 1)
    .Select((w, i) => w + " " + myWords[i + 1]);

You're could just use string.Split() and combine the results:你可以只使用string.Split()并结合结果:

var words = myString.Split(new char[] { ' ' });
var pairs = new List<string>();

for (int i = 0; i < words.Length - 1; i++)
{
    pairs.Add(words[i] + words[i+1]);
}

To do it only with RegEx and without post-processing, we can re-use Tim Pietzcker's answer but passing two consecutive RegEx要仅使用 RegEx 而不进行后处理,我们可以重用 Tim Pietzcker 的答案,但通过两个连续的 RegEx

We can pass the original from Tim Pietzcker's answer and the same with a lookbehind that will make the regex to start capturing from the second word.我们可以通过 Tim Pietzcker 的答案中的原始内容,并通过后向查看,这将使正则表达式从第二个单词开始捕获。

If you combine the results from the two RegEx you will have all the pairs from the text.如果将两个 RegEx 的结果结合起来,您将获得文本中的所有对。

Regex regexObj1 = new Regex(
    @"(     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (\w+)  # another word; capture that into backref 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[1].Value + matchResult.Groups[2].Value);
    matchResult = matchResult.NextMatch();
}

Regex regexObj2 = new Regex(
    @"(?<=  # Assert that there preceds and will not be captured
     \w+\s+ # the first word followed by any space
    )
    (     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (\w+)  # another word; capture that into backref 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);
Match matchResult1 = regexObj1.Match(subjectString);
Match matchResult2 = regexObj2.Match(subjectString);

etc ETC

For groups of threes:三人一组:

You will need to add a third RegEx to the program:您需要在程序中添加第三个 RegEx:

Regex regexObj3 = new Regex(
        @"(?<=  # Assert that there preceds and will not be captured
         \w+\s+\w+\s+ # the first and second word followed by any space
        )
        (     # Match and capture in backreference no. 1:
         \w+    # one or more alphanumeric characters
         \s+    # one or more whitespace characters.
        )       # End of capturing group 1.
        (?=     # Assert that there follows...
         (\w+)  # another word; capture that into backref 2.
        )       # End of lookahead.", 
        RegexOptions.IgnorePatternWhitespace);
    Match matchResult1 = regexObj1.Match(subjectString);
    Match matchResult2 = regexObj2.Match(subjectString);
    Match matchResult3 = regexObj3.Match(subjectString);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM