简体   繁体   中英

Use RegEx to parse a string with complicated delimiting

This is a RegEx question.

Thanks for any help and please be patient as RegEx is definitely not my strength !

Entirely as background...my reason for asking is that I want to use RegEx to parse strings similar to SVG path data segments. I've looked for previous answers that parse both the segments and their segment-attributes, but found nothing that does the latter properly.

Here are some example strings like the ones I need to parse:

M-11.11,-22
L.33-44  
ac55         66 
h77  
M88 .99  
Z 

I need to have the strings parsed into arrays like this:

["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]

So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:

var argsRX = /[\s,]|(?=-)/; 
var args = segment.split(argsRX);

Here's what I get:

 [ "M", -11.11, -22, <empty element>  ]
 [ "L.33", -44, <empty>, <empty> ]
 [ "ac55", <empty>, <empty>, <empty>, 66 <empty>  ]
 [ "h77", <empty>, <empty>  
 [ "M88", .99, <empty>, <empty> ]
 [ "Z", <empty> ]

Problems when using this regex:

  • An unwanted empty array element is being put at the end of each string's array.
  • If multiple spaces are delimiters, an unwanted empty array element is being created for each extra space.
  • If a number immediately follows the opening letters, that number is being attached to the letters, but should become a separate array element.

Here are more complete definitions of incoming strings:

  • Each string starts with 1 or more letters (mixed case).
  • Next are zero or more numbers.
  • The numbers might have minus signs (always preceeding).
  • The numbers might have a decimal point anywhere in the number (except the end).
  • Possible delimiters are: comma, space, spaces, the minus sign.
  • A Comma with space(s) in front or back is also a possible delimiter.
  • Even though minus signs are delimiters, they must also remain with their number.
  • A number might immediately follow the opening letters (no space) and that number should be separate.

Here is test code I've been using:

<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src="http://code.jquery.com/jquery.min.js"></script>

<style>
    body{ background-color: ivory; }
</style>

<script>
    $(function(){


var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);

for(var i=0;i<segments.length;i++){
    var segment=segments[i];
    //console.log(segment);

    var argsRX = /[\s,]|(?=-)/; 
    var args = segment.split(argsRX);
    for(var j=0;j<args.length;j++){
        var arg=args[j];
        console.log(arg.length+": "+arg);
    }

}

    }); // end $(function(){});
</script>

</head>

<body>
</body>
</html>
^([a-z]+)(?:(-?\d*.?\d+)[^\d\n\r.-]*(-?\d*.?\d+)?)?

Explanation

^               # start of string
([a-z]+)        # any number of characters, match into group 1
(?:             # non-capturing group
  (-?\d*.?\d+)  #   first number (optional singn & decimal point, digits)
  [^\d\n\r.-]*  #   delimiting characters (anything but these)
  (-?\d*.?\d+)? #   second number
)?              # end non-capturing group, make optional

Use with "case insensitive" flag.

Your "pattern" consists of one or more letters, followed by a decimal number, followed by another delimited by either a comma or whitespace.

Regex: /([az]+)(-?(?:\\d*\\.)?\\d+)(?:[,\\s]+|(?=-))(-?(?:\\d*\\.)?\\d+)/i

I had to perform very similar parsing of data for reporting live results at the nation's largest track meet. http://ksathletics.com/2013/statetf/liveresults.js Although there was a lot of both client and server-side code involved, the principles are the same. In fact, the kind of data was practically identical.

I suggest that you do not use one "jumbo" regular expression, but rather one expression which separates data pieces and another which breaks each data piece into its main identifier and the following values. This solves the problem of various delimiters by allowing the second-level regular expression to match the definition of data values rather than having to distinguish delimiters. (This also is more efficient than putting all of the logic into a single regular expression.)

This is a solution tested to work on the input you gave.

<script>
var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

function parseData(pathData) {
    var pieces = pathData.match(/([a-z]+[-.,\d ]*)/gi), i;
    /* now parse each piece into its own array */
    for (i=0; i<pieces.length; i++)
        pieces[i] = pieces[i].match(/([a-z]+|-?[.\d]*\d)/gi);
    return pieces;
}

pathPieces = parseData(pathData);
document.write(pathPieces.join('<br />'));
console.log(pathPieces);
</script>

http://dropoff.us/private/1370846040-1-test-path-data.html

Update: The results are exactly equivalent to the specified output you want. One thought that came to mind, however, was whether you also want or need type conversion from strings to numbers. Do you need that as well? I'm just thinking of the next step beyond parsing the data.

function parsePathData(pathData)
{
    var tokenizer = /([a-z]+)|([+-]?(?:\d+\.?\d*|\.\d+))/gi,
        match,
        current,
        commands = [];

    tokenizer.lastIndex = 0;
    while (match = tokenizer.exec(pathData))
    {
        if (match[1])
        {
            if (current) commands.push(current);
            current = [ match[1] ];
        }
        else
        {
            if (!current) current = [];
            current.push(match[2]);
        }
    }
    if (current) commands.push(current);
    return commands;
}

var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z";
var commands = parsePathData(pathData);
console.log(commands);

Output:

[ [ "M", "-11.11", "-22" ],
  [ "L", ".33", "-44" ],
  [ "ac", "55", "66" ],
  [ "h", "77" ],
  [ "M", "88", ".99" ],
  [ "Z" ] ]

You can try with this pattern:

/([a-z]+)(-?(?:\d*\.)?\d+)?(?:\s+|,|(-(?:\d*\.)?\d+))?(-?(?:\d*\.)?\d+)?/

(a bit long, but it seems to work)

Note that the last number can be in the capture group \\3 or \\4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM