How to use antlr4 to parse 3-byte utf8 string

Question

Blow is my grammar file.

grammar My;

tokens {
    DELIMITER
}

string:SINGLE_QUOTED_TEXT;

SINGLE_QUOTED_TEXT: (
        '\'' (.)*? '\''
    )+
;

I'm trying to use this to accpet all string(It's part of mysql's g4 actually). Then I use this code to test it:

#include "MyLexer.h"
#include "MyParser.h"
#include <string>
using namespace My;

int main()
{
    std::string s = "'中'";

    antlr4::ANTLRInputStream input(s);
    MyLexer lexer(&input);

    antlr4::CommonTokenStream tokens(&lexer);
    MyParser parser(&tokens);

    parser.string();

    return 0;
}

Result is

The Chinese character 中's utf8 code is 3 bytes: \xe4 \xb8 \xad

Both grammar file and code file are encoded in utf8. What can I to to let this work fine.

Answer 1

I'v figured out the problem.

Reference to https://stackoverflow.com/a/26865200/9634413

Antlr C++ runtime use a std::u32string to storage input, \xe4 will be casted to \xffffffe4, which is out of unicode range [0,0x10ffff].

To fix this problem, just override ANTLRInputStream's constructor like:

class MyStream : public antlr4::ANTLRInputStream {
public:
    MyStream(const std::string& input = "")
        : antlr4::ANTLRInputStream(input)
    {
        // Remove the UTF-8 BOM if present
        const char bom[4] = "\xef\xbb\xbf";
        if (input.compare(0, 3, bom, 3) == 0) {
            std::transform(input.begin() + 3, input.end(), _data.begin(),
                [](char c) -> unsigned char { return c; });
        }
        else {
            std::transform(input.begin(), input.end(), _data.begin(),
                [](char c) -> unsigned char { return c; });
        }
        p = 0;
    }
    MyStream(const char data_[], size_t numberOfActualCharsInArray)
        : antlr4::ANTLRInputStream(data_, numberOfActualCharsInArray)
    {
    }
    MyStream(std::istream& stream)
        : antlr4::ANTLRInputStream(stream)
    {
    }
};

How to use antlr4 to parse 3-byte utf8 string

Question

1 answers

solution1
0 2021-11-24 06:49:45

How to use antlr4 to parse 3-byte utf8 string

Question

1 answers

solution1 0 2021-11-24 06:49:45

solution1
0 2021-11-24 06:49:45