Blow is my grammar file.
grammar My;
tokens {
DELIMITER
}
string:SINGLE_QUOTED_TEXT;
SINGLE_QUOTED_TEXT: (
'\'' (.)*? '\''
)+
;
I'm trying to use this to accpet all string(It's part of mysql's g4 actually). Then I use this code to test it:
#include "MyLexer.h"
#include "MyParser.h"
#include <string>
using namespace My;
int main()
{
std::string s = "'中'";
antlr4::ANTLRInputStream input(s);
MyLexer lexer(&input);
antlr4::CommonTokenStream tokens(&lexer);
MyParser parser(&tokens);
parser.string();
return 0;
}
The Chinese character 中's utf8 code is 3 bytes: \xe4 \xb8 \xad
Both grammar file and code file are encoded in utf8. What can I to to let this work fine.
I'v figured out the problem.
Reference to https://stackoverflow.com/a/26865200/9634413
Antlr C++ runtime use a std::u32string to storage input, \xe4 will be casted to \xffffffe4, which is out of unicode range [0,0x10ffff].
To fix this problem, just override ANTLRInputStream's constructor like:
class MyStream : public antlr4::ANTLRInputStream {
public:
MyStream(const std::string& input = "")
: antlr4::ANTLRInputStream(input)
{
// Remove the UTF-8 BOM if present
const char bom[4] = "\xef\xbb\xbf";
if (input.compare(0, 3, bom, 3) == 0) {
std::transform(input.begin() + 3, input.end(), _data.begin(),
[](char c) -> unsigned char { return c; });
}
else {
std::transform(input.begin(), input.end(), _data.begin(),
[](char c) -> unsigned char { return c; });
}
p = 0;
}
MyStream(const char data_[], size_t numberOfActualCharsInArray)
: antlr4::ANTLRInputStream(data_, numberOfActualCharsInArray)
{
}
MyStream(std::istream& stream)
: antlr4::ANTLRInputStream(stream)
{
}
};
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.