简体   繁体   中英

Dealing with grammar ambiguity (poker file parsing)

I am currently working on a poker hand history parser as a part of my bachelor project. I've been doing some research past couple of days, and came across a few nice parser generators (of which I chose JavaCC, since the project itself will be coded in Java).

Despite the hand history grammar being pretty basic and straightforward, there's an ambiguity problem due to allowed set of characters in player's nickname.

Suppose we have a line in a following format:

Seat 5: myNickname (1500 in chips)

Token myNickname can contain any character as well as white spaces. This means, that both (1500 in chip and Seat 5: are valid nicknames - which ultimately leads to an ambiguity problem. There are no restrictions on player's nickname except for length (4-12 characters).

I need to parse and store several data along with player's nickname (eg seat position and amount of chips in this particular case), so my question is, what are my options here?

I would love to do it using JavaCC, something along this:

SeatRecord seat() :
{ Token seatPos, nickname, chipStack; }
{
    "Seat" seatPos=<INTEGER> ":" nickname=<NICKNAME> "(" chipStack=<INTEGER> 
    "in chips)"
    {
        return new SeatRecord(seatPos.image, nickname.image, chipStack.image); 
    }
}  

Which right now doesn't work (due to the mentioned problem)

I also searched around for GLR parsers (which apparently handle ambigious grammars) - but they mostly seem to be abandoned or poorly documented, except for Bison, but that one doesn't support GLR parsers for Java, and might be too complex to work with anway (aside for the ambiguity problem, the grammar itself is pretty basic, as I mentioned)

Or should I stick to tokenizing the string myself, and use indexOf(), lastIndexOf() etc. to parse the data I need? I would go for it only if it was the only option remaining, since it would be too ugly IMHO and I might miss some cases (which would lead to incorrect parsing)

If your input format is as simple as you specify, you can probably get away with a simple regular expression:

^Seat ([0-9]+): (.*) \(([0-9]+) in chips\)$

The NFA of the regex engine in this case solves your ambiguity, and the parentheses are capture groups so that you can extract the information you are interested in.

You have two solutions:

  • Add some restrictions to the names. I can hardly remember any widely used system that would accept such nicknames. Just let them use Alphanumeric characters and "_" separators. Also you can add keywords for seat, for example, that such a word cannot be a nickname.
  • Also you can build a finite automaton for parsing, based on your grammar. I think, FSM can handle such ambiguity grammar. Once you have it, you can parse everything you want.

Anyway, I think, there is a problem with the original design. The nicknames should not allow such a set of names. Also, why cannot you use identifiers instead of names - the names can be stored in a database.

A grammar for your system could look like this (written as a context-free grammar) :

S -> seating nickname chips

seating -> "Seat " number ":"
number -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
number -> number number

nickname -> "a" | "b" | "c" ...... | "z" | ...."+" | "?" | number
nickname -> nickname nickname 

chips -> "(" number "in chips)"

Notice the rule of the form :

number -> number number

This basically allows for an infinite grammar. Note that "infinite grammar" does not mean that you encapsulate everything. The above line is basically the equivalent of the regex (\\d*) .

I find that typing down the grammar in a CFG and then converting it into regular grammar helps me most of the time. More on how to do that here . Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM