简体   繁体   中英

How do I make my regex match whitespaces without consuming them?

I'm trying to match lines that contains chords, but I need to make sure each match is surrounded by whitespace or first in line without consuming the characters as I don't want them returned to the caller.

Eg

Standard Tuning (Capo on fifth fret)

Time signature: 12/8
Tempo: 1.5 * Quarter note = 68 BPM

Intro: G Em7 G Em7

  G                 Em7
I heard there was a secret chord
     G                   Em7
That David played and it pleased the lord
    C                D              G/B     D
But you don't really care for music, do you? 
        G/B                C          D
Well it goes like this the fourth, the fifth
    Em7                 C
The minor fall and the major lift
    D            B7/D#         Em
The baffled king composing hallelujah

Chorus:

G/A   G/B  C           Em         C             G/B   D/A    G
Hal - le-  lujah, hallelujah, hallelujah, hallelu-u-u-u-jah .... 

Almost works except it also matches the "B" in "68 BPM". Now how do I make sure that chords are correctly matched? I don't want it to match the B in Before or the D or E in SUBSIDE?

This is my algorithm for matching on each separate line:

function getChordMatches(line) {
    var pattern = /[ABCDEFG](?:#|##|b|bb)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[ABCDEFG](?:#|##|b|bb)?)?/g;
    var chords = line.match(pattern);
    var positions = [];
    while ((match = pattern.exec(line)) != null) {
        positions.push(match.index);
    }

    return {
        "chords":chords,
        "positions":positions
    };
}

That is I want arrays on the form ["A", "Bm", "C#"] and not [" A", "Bm ", " C# "].

edit

I made it work using the accepted answer. I had to make some adjustments to accomodate the leading whitespaces. Thanks for taking the time everyone!

function getChordMatches(line) {
    var pattern = /(?:^|\s)[A-G](?:##?|bb?)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[A-G](?:##?|bb?)?)?(?!\S)/g;
    var chords = line.match(pattern);
    var chordLength = -1;
    var positions = [];

    while ((match = pattern.exec(line)) != null) {
        positions.push(match.index);
    }

    for (var i = 0; chords && i < chords.length; i++) {
        chordLength = chords[i].length;
        chords[i] = chords[i].trim();
        positions[i] -= chords[i].length - chordLength;
    }

    return {
        "chords":chords,
        "positions":positions
    };
}

I assume that you have split the input into lines already. And the function will process the lines one by one.

You just need to check that the line has a chord as the first item before extracting them:

if (/^\s*[A-G](?:##?|bb?)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[A-G](?:##?|bb?)?)?(?!\S)/.test(line)) {
    // Match the chords here
}

I added ^\\s* in front to check from the beginning of the line, and added (?!\\S) to check that there is a whitespace character \\s or end of line after the first chord.

Note that I made some minor changes to your regex, since A## (assuming it is valid chord) will not be matched by your current regex. The regex engine will check the match by following the order of the patterns in alternation, so # will be attempted first in #|## . It will find that A# matches and return the match without checking for ## . Either reversing the order ##|# or use greedy quantifier ##? fixes the problem, as it checks for the longer alternative first.


If you are sure that: "if the first item is a chord, then the rest are chords", then instead of matching, you can just split by spaces:

line.split(/\s+/);

Update

If you want to just match your pattern, regardless of whether the chord is inside a sentence (what you currently have will do that):

/(?:^|\s)[A-G](?:##?|bb?)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[A-G](?:##?|bb?)?)?(?!\S)/

This regex is to be placed in the code you have in your question.

I check that the chord is preceded by whitespace character or is the beginning of the line with (?:^|\\s) . You need to trim the leading space in the result, though.

Using \\b instead of (?:^|\\s) will avoid leading space issue, but the meaning is different. Unless you know the input well enough, I'd advice against it.


Another way is to split the string by \\s+ , and test the following regex against each of the token (note the ^ at the beginning and $ at the end):

 /^[A-G](?:##?|bb?)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[A-G](?:##?|bb?)?)?$/

Adding \\b (word boundary) to the start and end works for me. Also, you can use AG instead of ABCDEFG . Thus:

> re = /\b[A-G](?:#|##|b|bb)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[A-G](?:#|##|b|bb)?)?\b/g
/\b[A-G](?:#|##|b|bb)?(?:min|m)?(?:maj|add|sus|aug|dim)?[0-9]*(?:\/[A-G](?:#|##|b|bb)?)?\b/g

> 'G/A   G/B  C           Em         C             G/B   D/A    G'.match(re)
["G/A", "G/B", "C", "Em", "C", "G/B", "D/A", "G"]

> 'Tempo: 1.5 * Quarter note = 68 BPM'.match(re)
null

In answer to the specific question in the title, use the look ahead :

 (?=\s)

when embedded in an RE would ensure that the following character was a whitespace without consuming it.

Try the following

function getChordMatches( line ) {
    var match,
        pattern = /(?:^|\s)([A-G](?:##?|bb?)?(?:min|m)?(?:maj|add|sus|aug|dim)?\d*(?:\/[A-G](?:##?|bb?)?)?)(?=$|\s)/g,
        chords = [],
        positions = [];

    while ( match = pattern.exec(line) ) {
        chords.push( match[1] );
        positions.push( match.index );
    }

    return {
        "chords" : chords,
        "positions" : positions
    };
}

It uses (?:^|\\s) to make sure the chord is either at the start of the line or is preceded by a space, and uses the positive look-ahead (?=$|\\s) to make sure the chord is followed by a space or is at the end of the line. Parentheses are added to capture the chord itself, which is then accessed by match[1] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM