简体   繁体   English

用boost精神解析html转义序列

[英]Parse html escape sequence with boost spirit

I try to parse text with html escape sequences and want to chnage this esaceps with they utf8 equvivalents: 我尝试使用html转义序列来解析文本,并希望通过utf8等价物来修改此快捷方式:

  - 0xC2A0 utf8 representation
­ - 0xC2AD utf8 representation

And have gramar to solve this 并有格拉玛来解决这个问题

template <typename Iterator>
struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
{
    HTMLEscape_grammar() :
        HTMLEscape_grammar::base_type(text)
    {
        htmlescapes.add("&nbsp;", 0xC2AD);
        htmlescapes.add("&shy;", 0xC2AD);

        text = +((+(qi::char_ - htmlescapes)) | htmlescapes);
    }

private:
    qi::symbols<char, uint32_t> htmlescapes;
    qi::rule<Iterator, std::string()> text;
};

but when we parse 但是当我们解析

std::string l_test = "test&shy;as test simple&shy;test";
HTMLEscape_grammar<std::string::const_iterator> l_gramar;

std::string l_ast;
bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);

We doesn't get utf-8 string, 0xC2 part of utf8 symbols simply cut, and we got simply ascii string. 我们没有得到utf-8字符串,只是削减了utf8符号的0xC2部分,而得到了ascii字符串。 This parser is build block of more powerfull system so utf8 output is require. 该解析器是功能更强大的系统的构建块,因此需要utf8输出。

I don't know how you suppose that exposing a uint32_t will magically output a UNICODE codepoint. 我不知道您如何认为公开uint32_t会神奇地输出UNICODE代码点。 Let alone that something will magically perform UTF8 encoding for that. 更不用说它会神奇地执行UTF8编码了。

Now let me get this straight. 现在让我弄清楚这一点。 You desire to have selected HTML entity references replaced by 슭 (HANGUL SYLLABLE SEULG). 您希望将选定的HTML 实体引用替换为슭(HANGUL SYLLABLE SEULG)。 In UTF-8 that would be 0xEC 0x8A 0xAD. 在UTF-8中,它将为0xEC 0x8A 0xAD。

Just do the encoding yourself (you're composing an output stream of UTF8 code units anyways): 只需自己进行编码(无论如何,您都在组成UTF8代码单元的输出流):

Live On Coliru 生活在Coliru

#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <iomanip>

namespace qi = boost::spirit::qi;

template <typename Iterator>
struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
{
    HTMLEscape_grammar() :
        HTMLEscape_grammar::base_type(text)
    {
        htmlescapes.add("&nbsp;", { '\xEC', '\x8A', '\xAD' });
        htmlescapes.add("&shy;",  { '\xEC', '\x8A', '\xAD' });

        text = *(htmlescapes | qi::char_);
    }

private:
    qi::symbols<char, std::vector<char> > htmlescapes;
    qi::rule<Iterator, std::string()> text;
};

int main() {
    std::string const l_test = "test&shy;as test simple&shy;test";
    HTMLEscape_grammar<std::string::const_iterator> l_gramar;

    std::string l_ast;
    bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);

    if (result) {
        std::cout << "Parse success\n";
        for (unsigned char ch : l_ast)
            std::cout << std::setw(2) << std::setfill('0') << std::hex << std::showbase << static_cast<int>(ch) << " ";
    } else
    {
        std::cout << "Parse failure\n";
    }
}

Prints 打印

Parse success
0x74 0x65 0x73 0x74 0xec 0x8a 0xad 0x61 0x73 0x20 0x74 0x65 0x73 0x74 0x20 0x73 0x69 0x6d 0x70 0x6c 0x65 0xec 0x8a 0xad 0x74 0x65 0x73 0x74 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM