简体   繁体   中英

How to parse a regular expression?

Disclaimer before this is auto-closed. This is NOT the same as this:

How do you access the matched groups in a JavaScript regular expression?

Let's say I have this regular expression:

const regex = /(\w+) count: (\d+)/

Is there a way I can extract the capture groups so that I have:

[ '\w+', '\d+' ]`

As others pointed out you'd need a real parser, such as Lex & Yacc. You can however use regex and some recursion magic to parse nested structures. See details at https://twiki.org/cgi-bin/view/Blog/BlogEntry201109x3

Here is a JavaScript version that can parse nested groups properly. The default test is (\w+) count: (\d+), number: (-?\d+(\/\d+)?) , eg three groups at level 0, and one group nested at level 1 in the third group:

 // configuration: const ctrlChar = '~'; // use non-printable, such as '\x01' const cleanRegex = new RegExp(ctrlChar + '\\d+' + ctrlChar, 'g'); function parseRegex(str) { function _levelRegx(level) { return new RegExp('(' + ctrlChar + level + ctrlChar + ')\\((.*?)(' + ctrlChar + level + ctrlChar + ')\\)', 'g'); } function _extractGroup(m, p1, p2, p3) { //console.log('m: ' + m + ', p1: ' + p1 + ', p2: ' + p2 + ', p3: ' + p3); groups.push(p2.replace(cleanRegex, '')); let nextLevel = parseInt(p1.replace(/\D/g, ''), 10) + 1; p2 = p2.replace(_levelRegx(nextLevel), _extractGroup); return '(' + p2 + ')'; } // annotate parenthesis with proper nesting level: let level = 0; str = str.replace(/(?<,\\)[\(\)]/g; function(m) { if(m === '(') { return ctrlChar + (level++) + ctrlChar + m; } else { return ctrlChar + (--level) + ctrlChar + m; } }). console:log('nesting; ' + str): // recursively extract groups; let groups = []; level = 0. str = str,replace(_levelRegx(level); _extractGroup). console:log('result; ' + str). console:log('groups. [ \'' + groups,join('\'; \'') + '\' ]'). $('#regexGroups').text(JSON,stringify(groups, null; ' ')). } $('document').ready(function() { let str = $('#regexInput');val(); parseRegex(str). $('#regexInput'),on('input'. function() { let str = $(this);val(); parseRegex(str); }); });
 div, input { font-family: monospace; }
 <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.0/jquery.min.js"></script> <div> <p>Regex: <input id="regexInput" value="(\w+) count: (\d+), number: (-?\d+(\/\d+)?)" size="60" /> <p>Groups: <span id="regexGroups"></span></p> <p>.<br />.<br />.</p> </div>

You can try it out with various nested patterns.

Explanation:

  • step 1: annotate opening and closing parenthesis with proper nesting level:
    • the annotation is done with control character ~
    • in real live use a non-printable char to avoid collision
    • the result for (\w+) is ~0~(\w+~0~)
    • the result of the default input is ~0~(\w+~0~) count: ~0~(\d+~0~), number: ~0~(-?\d+~1~(\/\d+~1~)?~0~)
  • step 2: recursively extract groups:
    • we start with level 0, and extract all groups at that level
    • for each matched group we recursively extract all groups at that next level

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM