简体   繁体   中英

Regex to extract first 3 words from a string

I am trying to replace all the words except the first 3 words from the String (using textpad).

Ex value: This is the string for testing.

I want to extract just 3 words: This is the from above string and remove all other words.

I figured out the regex to match the 3 words (\\w+\\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?

Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:

^((?:\S+\s+){2}\S+).*

which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:

resultString = Regex.Replace(subjectString, @"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);

EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.

If you want to eliminate the first three words, and capture the rest,

^(?:\w+\s+){3}([^\n\r]+)$

?: changes the first three words to a non-capturing group, and captures everything after it.

Is this what you're looking for? I'm not totally clear on your question, or your goal.

As suggested, here's the opposite. Capture the first three words only, and discard the rest:

^(\w+\s+){3}(?:[^\n\r]+)$

Just move the ?: from the first to the second grouping.

As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:

^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$

And then, for instance, you could replace each with its first letter capitalized:

Replace with: \\u$1 \\u$2 \\u$3\u003c/code>

Result is This Is The

In TextPad, lowercase \\u\u003c/code> in the replacement means change only the next letter. Uppercase \\U changes everything after it (until the next capitalization flag).

Try it:

http://fiddle.re/f3hgv

(press on [Java] or whatever language is most relevant. Note that \\u is not supported by RegexPlanet.)

Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \\s , \\W , etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read eg Why are there so many different regular expression dialects?

If you have POSIX class support, you can use [[:alpha:]] for \\w , [^[:alpha:]] for \\W , [[:space:]] for \\s , etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.

[^ ]+[ ]+[^ ]+[ ]+[^ ]+

matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \\t for tab, or you are able to paste a regular tab in its place, you could extend this to

[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+

In most shells, you can type a literal tab with ctrl + v tab , ie prefix it with an escape code, which is often typed by holding down the ctrl key and typing v .)

To actually use this, you might want to do

grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file

where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps

sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file

to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \\1 in the replacement part in the s command in sed ). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed ; if your sed doesn't have it, try -E , or put a backslash before each parenthesis and plus sign.

Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be

sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'

where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)

If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \\1 , the second \\2 , etc.)

sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file

Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \\1 (or whatever it's called in your dialect -- TextMate seems to use $1 , just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM