简体   繁体   中英

Need C# Regex to get pairs of words in a sentence

Is there a regex that would take the following sentence:

"I want this split up into pairs"

and generate the following list:

"I want", "want this", "this split", "split up", "up into", "into pairs"

Since words need to be re-used, you need lookahead assertions:

Regex regexObj = new Regex(
    @"(     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (\w+)  # another word; capture that into backref 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[1].Value + matchResult.Groups[2].Value);
    matchResult = matchResult.NextMatch();
}

For groups of threes:

Regex regexObj = new Regex(
    @"(     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (      # and capture...
      \w+   # another word,
      \s+   # whitespace,
      \w+   # word.
     )      # End of capturing group 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);

etc.

You could do

var myWords = myString.Split(' ');

var myPairs = myWords.Take(myWords.Length - 1)
    .Select((w, i) => w + " " + myWords[i + 1]);

You're could just use string.Split() and combine the results:

var words = myString.Split(new char[] { ' ' });
var pairs = new List<string>();

for (int i = 0; i < words.Length - 1; i++)
{
    pairs.Add(words[i] + words[i+1]);
}

To do it only with RegEx and without post-processing, we can re-use Tim Pietzcker's answer but passing two consecutive RegEx

We can pass the original from Tim Pietzcker's answer and the same with a lookbehind that will make the regex to start capturing from the second word.

If you combine the results from the two RegEx you will have all the pairs from the text.

Regex regexObj1 = new Regex(
    @"(     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (\w+)  # another word; capture that into backref 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[1].Value + matchResult.Groups[2].Value);
    matchResult = matchResult.NextMatch();
}

Regex regexObj2 = new Regex(
    @"(?<=  # Assert that there preceds and will not be captured
     \w+\s+ # the first word followed by any space
    )
    (     # Match and capture in backreference no. 1:
     \w+    # one or more alphanumeric characters
     \s+    # one or more whitespace characters.
    )       # End of capturing group 1.
    (?=     # Assert that there follows...
     (\w+)  # another word; capture that into backref 2.
    )       # End of lookahead.", 
    RegexOptions.IgnorePatternWhitespace);
Match matchResult1 = regexObj1.Match(subjectString);
Match matchResult2 = regexObj2.Match(subjectString);

etc

For groups of threes:

You will need to add a third RegEx to the program:

Regex regexObj3 = new Regex(
        @"(?<=  # Assert that there preceds and will not be captured
         \w+\s+\w+\s+ # the first and second word followed by any space
        )
        (     # Match and capture in backreference no. 1:
         \w+    # one or more alphanumeric characters
         \s+    # one or more whitespace characters.
        )       # End of capturing group 1.
        (?=     # Assert that there follows...
         (\w+)  # another word; capture that into backref 2.
        )       # End of lookahead.", 
        RegexOptions.IgnorePatternWhitespace);
    Match matchResult1 = regexObj1.Match(subjectString);
    Match matchResult2 = regexObj2.Match(subjectString);
    Match matchResult3 = regexObj3.Match(subjectString);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM