I have a char array containing some UTF-8-encoded Turkish characters - in the form of escaped octets. Thus if I run this code in C++11:
void foo(char* utf8_encoded) {
cout << utf8_encoded << endl;
}
it prints \\xc4\\xb0-\\xc3\\x87-\\xc3\\x9c-\\xc4\\x9e
. I want to convert this char[]
to an std::string
so that it contains UTF-8 decoded values İ-Ç-Ü-Ğ
. I have converted that char[]
to wstring
but it still prints as \\xc4\\xb0-\\xc3\\x87-\\xc3\\x9c-\\xc4\\x9e
. How can I do that?
EDIT: I'm not the one who constructs this char[]. It is one of the static-length parameter of a callback function called by a private library. So the callback function is as follows:
void some_callback_function (INFO *info) {
cout << info->some_char_array << endl;
cout << "*****" << endl;
for(int i=0; i<64; i++) {
cout << "-" << info->some_char_array[i];
}
cout << "*****" << endl;
char bar[65] = "\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e";
cout << bar << endl;
}
Where the INFO
struct is:
typedef struct {
char some_char_array[65];
} INFO;
So when my callback function is called, the output is as follows:
\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e
*****
-\-x-c-4-\-x-b-0---\-x-c-3-\-x-8-7---\-x-c-3-\-x-9-c---\-x-c-4-\-x-9-e-----------------------------
*****
İ-Ç-Ü-Ğ
So my current question is, I didn't get the difference between info->some_char_array
and bar
char arrays. What I want is to edit info->some_char_array
such that, it prints the output as İ-Ç-Ü-Ğ
.
OK, this is a bit of a handful, ripped out of a larger parser I am using. But "a bit of a handful" is the nature of Boost.Spirit . ;-)
The parser will not only parse hexadecimal escapes, but octals ( \\123
) and "standard" escapes ( \\n
) as well. Provided under CC0, so you can do with it whatever you like. ;-)
Boost.Spirit is a "header only" part of Boost, so you don't need to link in any library code. The rather involved "magic" done by the Spirit headers to allow grammars expressed in C++ source this way is a bit hard on the compile time, though.
But it works, and works well.
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include "boost/spirit/include/qi.hpp"
#include "boost/spirit/include/phoenix.hpp"
#include <string>
#include <cstring>
#include <sstream>
#include <stdexcept>
namespace
{
// Helper function: Turn on_error positional parameters into error message.
template< typename Iterator >
std::string make_error_message( boost::spirit::info const & info, Iterator first, Iterator last )
{
std::ostringstream oss;
oss << "Invalid sequence. Expecting " << info << " here: \"" << std::string( first, last ) << "\"";
return oss.str();
}
}
// Wrap helper function with Boost.Phoenix boilerplate, so the function
// can be called from within a parser's [].
BOOST_PHOENIX_ADAPT_FUNCTION( std::string, make_error_message_, make_error_message, 3 )
// Supports various escape sequences:
// - Character escapes ( \a \b \f \n \r \t \v \" \\ )
// - Octal escapes ( \n \nn \nnn )
// - Hexadecimal escapes ( \xnn ) (*)
//
// (*): In C/C++, a hexadecimal escape runs until the first non-hexdigit
// is encountered, which is not very helpful. This one takes exactly
// two hexdigits.
// Declaring a grammer that works given any kind of iterator,
// and results in a std::string object.
template < typename Iterator >
class EscapedString : public boost::spirit::qi::grammar< Iterator, std::string() >
{
public:
// Constructor
EscapedString() : EscapedString::base_type( escaped_string )
{
// An escaped string is a sequence of
// characters that are not '\', or
// an escape sequence
escaped_string = *( +( boost::spirit::ascii::char_ - '\\' ) | escapes );
// An escape sequence begins with '\', followed by
// an escaped character (e.g. "\n"), or
// an 'x' and 2..2 hexadecimal digits, or
// 1..3 octal digits.
escapes = '\\' > ( escaped_character
| ( "x" > boost::spirit::qi::uint_parser< char, 16, 2, 2 >() )
| boost::spirit::qi::uint_parser< char, 8, 1, 3 >() );
// The list of special "escape" characters
escaped_character.add
( "a", 0x07 ) // alert
( "b", 0x08 ) // backspace
( "f", 0x0c ) // form feed
( "n", 0x0a ) // new line
( "r", 0x0d ) // carriage return
( "t", 0x09 ) // horizontal tab
( "v", 0x0b ) // vertical tab
( "\"", 0x22 ) // literal quotation mark
( "\\", 0x5c ) // literal backslash
;
// Error handling
boost::spirit::qi::on_error< boost::spirit::qi::fail >
(
escapes,
// backslash not followed by a valid sequence
boost::phoenix::throw_(
boost::phoenix::construct< std::runtime_error >( make_error_message_( boost::spirit::_4, boost::spirit::_3, boost::spirit::_2 ) )
)
);
}
private:
// Qi Rule member
boost::spirit::qi::rule< Iterator, std::string() > escaped_string;
// Helpers
boost::spirit::qi::rule< Iterator, std::string() > escapes;
boost::spirit::qi::symbols< char const, char > escaped_character;
};
int main()
{
// Need to escape the backslashes, or "\xc4" would give *one*
// byte of output (0xc4, decimal 196). I understood the input
// to be the FOUR character hex char literal,
// backslash, x, c, 4 in this case,
// which is what this string literal does.
char * some_char_array = "\\xc4\\xb0-\\xc3\\x87-\\xc3\\x9c-\\xc4\\x9e";
std::cout << "Input: '" << some_char_array << "'\n";
// result object
std::string s;
// Create an instance of the grammar with "char *"
// as the iterator type.
EscapedString< char * > es;
// start, end, parsing grammar, result object
boost::spirit::qi::parse( some_char_array,
some_char_array + std::strlen( some_char_array ),
es,
s );
std::cout << "Output: '" << s << "'\n";
return 0;
}
This gives:
Input: '\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e'
Output: 'İ-Ç-Ü-Ğ'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.