简体   繁体   中英

What is the regex to match this string?

Consider these sentences:

apple is 2kg
apple banana mango is 2kg
apple apple apple is 6kg
banana banana banana is 6kg

Given that "apple", "banana", and "mango" are the only fruits, what would be the regex to extract the fruit name(s) that appear in the start of the sentence?

I wrote this regex ( https://regex101.com/r/fY8bK1/1 ):

^(apple|mango|banana) is (\d+)kg$  

but this only matches if a single fruit is in the sentence.

How do I extract all the fruit names?

The expected output, for all 4 sentences, should be:

apple, 2
apple banana mango, 2
apple apple apple, 6
banana banana banana, 6

You can use grouping like this:

^((?:apple|mango|banana)(?:\s+(?:apple|mango|banana))*) is (\d+)kg$

See regex demo

The (?:...) is a non-capturing group inside a capturing ( (...) ) group so as not to create a mess in the output.

The ((?:apple|mango|banana)(?:\\s+(?:apple|mango|banana))*) group matches:

  • (?:apple|mango|banana) - any value from the alternative list delimited with alternation | operator. If you plan to match whole words only, put \\b at both ends of the subpattern.
  • (?:\\s+(?:apple|mango|banana))* matches 0 or more sequences of...
    • \\s+ - 1 or more whitespace
    • (?:apple|mango|banana) - any of the alternatives.

Snippet:

 var re = /^((?:apple|mango|banana)(?:\\s+(?:apple|mango|banana))*) is (\\d+)kg$/gm; var str = 'apple is 2kg\\napple banana mango is 2kg\\napple apple apple is 6kg\\nbanana banana banana is 6kg'; var m; while ((m = re.exec(str)) !== null) { document.write(m[1] + "," + m[2] + "<br/>"); } document.write("<b>appleapple is 2kg</b> matched: " + /^((?:apple|mango|banana)(?:\\s+(?:apple|mango|banana))*) is (\\d+)kg$/.test("appleapple is 2kg")); 

Try this

var re = /^((?:(?:apple|banana|mango)(?= ) ?)+) is (\d+)kg$/gm;

re.exec('apple banana mango is 2kg');
// ["apple banana mango is 2kg", "apple banana mango", "2"]

What is making this different to the other answers? The (?= ) ? after the fruit options forces a space as the next character but doesn't capture it unless there are more fruits (or you double spaced the is ).

正则表达式可视化

Use this in a while loop to get all the results from a multi-line string.


The gm flags here let this RegExp be applied to the same String multiple times using re.exec , where new lines match $^ . However, the g flag causes str.match to behave differently.

If you want an independent test for each string you could continue using re.exec or remove these flags and use str.match instead

var re = /^((?:(?:apple|banana|mango)(?= ) ?)+) is (\d+)kg$/; // notice flags gone

'apple banana mango is 2kg'.match(re);
// ["apple banana mango is 2kg", "apple banana mango", "2"]
/^(((apple|mango|banana)\s*)+) is (\d+)kg$/$1,$4/gm

DEMO: https://regex101.com/r/sA4aW7/2

So you start from here, one of:

(apple|mango|banana)

Lets get the eventual whitespace separating repetitions:

(apple|mango|banana)\s*

and all (one at the least) of the repetitions:

((apple|mango|banana)\s*)+

Need to add an additional group, because you want a single group capturing the lot:

(((apple|mango|banana)\s*)+)

Add this point, $1 (the outermost group) will contain "banana banana banana ..."; the fourth your weight. Add your own ?: to avoid capturing inner groups if you like .

^((?:apple|mango|banana| )+) is (\d+)kg\s?$/gmi

DEMO

https://regex101.com/r/dO1rR7/1


Explanation

^((?:apple|mango|banana| )+) is (\d+)kg\s?$/gmi

^ assert position at start of a line
1st Capturing group ((?:apple|mango|banana| )+)
    (?:apple|mango|banana| )+ Non-capturing group
        Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
        1st Alternative: apple
            apple matches the characters apple literally (case sensitive)
        2nd Alternative: mango
            mango matches the characters mango literally (case sensitive)
        3rd Alternative: banana
            banana matches the characters banana literally (case sensitive)
        4th Alternative:  
             matches the character  literally
 is matches the characters  is literally (case sensitive)
2nd Capturing group (\d+)
    \d+ match a digit [0-9]
        Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
kg matches the characters kg literally (case sensitive)
\s? match any white space character [\r\n\t\f ]
    Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
$ assert position at end of a line
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM