简体   繁体   中英

Splitting string by a colon that is NOT contained within parenthesis ()

I'm working on an HTML5/JavaScript game engine, and I have started to encounter a scenario I haven't ever been in the past, and can't figure out how I can pull this off.

Simply put, I want to split a string into an array, by a character - so long as that character is not within parenthesis.

Basically, in XML files for things like items/tiles, I store "triggers", which are statements giving rules for operations the code will perform. The different parameters of a single trigger are split up with a colon (:), and more than one trigger can be in place for an item, whereby each trigger is split by a comma. Here's an example:

<response trigger="npc:self:dialog:1:3">No, thank you.</response>

(This is basically saying: if this response is selected, make the NPC who asked the initial question cycle to the a specific message of a specific conversion)

Moving along: I have come to need the ability to encapsulate callback triggers within parenthesis of parameters with certain triggers. Here's an example:

<response trigger="shop:open:1:(npc:self:dialog:1:4)">Yes, please.</response>

(This is basically saying: open up a specific store, and when the store is closed, jump to a specific conversation/message for the speaking NPC)

The idea is that when a store is closed, I can invoke the 4th parameter of that trigger (which is a trigger itself). As I am sure you have guessed, the problem here is that if I split the initial trigger-string based on ":", then it breaks up the callback trigger as other (messy) parameters of the main trigger. I don't want that. Nor, do I want to do anything like splitting secondary triggers by another character (for generation reasons later on, and because I imagine there will be times where I will want to nest lots of triggers at deeper levels and I don't want to use different characters. I know of work-arounds, but I'd like to learn the proper way to split by a character that is not contained within other specific characters.

Since I am encapsulating the callback parameter with parenthesis, I figure there must be a clean regular expression I can use to split the main trigger string by all colons NOT INSIDE parenthesis.

Sadly, I haven't been able to come up with the right expression to get this done.

Any ideas?

I greatly appreciate any assistance any of you may have. :)

I suspect you can't, at least if there's any chance of nested parentheses, since recognizing correct parenthesis-nesting is not regular.

In any case, instead of constructing some baroque regular expression, consider a very simple parser: scan to the next occurrence of either ":" or "(", and do something with the next token. Repeat. It would be easy to do with with recursive descent, and would look something like

parse(string)
   if string is empty: return
   scan to delimiter, put delimiter index into d, token string into t
   put t into a table for processing later
   case on d:
      string[d] == ":": parseColonToken(string[d+1:])
      string[d] == "(": parseParentString(strin[d+1:])
   end
end

(Obviously this is pseudocode. Take string[n:] as "the substring of string from index n to the end.)

probably, thinking about it, you'd simply start with parseColonToken but I'm not sure if that matches your expected grammar.

There is a good reason why you were unable to find a regular expression for your problem:

The language you describe is not regular , ie it cannot be parsed with a regular expression .

Basically, you have to parse the parenthesis structure in order to determine the colons which are outside of all parentheses. This is not possible with a regular expression.

The language of nested parenthesis is context-free [1], so it straight-forward to write a recursive parser.

[1] http://en.wikipedia.org/wiki/Context-free_language

ADDITION: You don't need a recursive parser, a simple counter for the parenthesis nesting level is enough:

// Pseudo code
int depth = 0;
List<int> breakIndices;
for int index = 0 .. input.length-1:
  switch(input[index])
    ':': if (depth==0) breakIndices.add(index); break;
    '(': depth++; break;
    ')': depth--; break;
    default: break;
// Now, all indices of the colons you need are in the breakIndices list.

I think the easiest approach would be to break the string into the "function" part and the "argument" part and then deal with the two parts separately. If you want to keep the parentheses on the argument part, then:

var parts1 = "shop:open:1:(npc:self:dialog:1:4)".split(/:(?=\()/);
// parts1 now looks like ["shop:open:1", "(npc:self:dialog:1:4)"]
var parts2 = "shop:open:1".split(/:(?=\()/);
// parts2 now looks like ["shop:open:1"]

And then:

var cmd = null;
var arg = null;
if(parts.length > 0) {
    cmd = parts[0].split(':');
    arg = (parts[1] || '').replace(/[()]/g, '').split(':');
}

You might be able to cram more of that into a single regex (and possibly all of it depending on what non-regular features your target regex engine supports) but there's not much point and clarity is a better goal for your code than "short". Anyone looking at the above should be able to figure out what it is doing if they have a decent JavaScript regex reference in hand.

If you end up dealing with more complicated expressions with quoting and escaping and such, then you could try modifying a CSV parser to do what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM