UTF-8 to UTF-32 on iterators using the STL

Question

I have a char iterator - an std::istreambuf_iterator<char> wrapped in a couple of adaptors - yielding UTF-8 bytes. I want to read a single UTF-32 character (a char32_t ) from it. Can I do so using the STL? How?

There's std::codecvt_utf8<char32_t> , but that apparently only works on char* , not arbitrary iterators.

Here's a simplified version of my code:

#include <iostream>
#include <sstream>
#include <iterator>

// in the real code some boost adaptors etc. are involved
// but the important point is: we're dealing with a char iterator.
typedef std::istreambuf_iterator< char > iterator;

char32_t read_code_point( iterator& it, const iterator& end )
{
    // how do I do this conversion?
    // codecvt_utf8<char32_t>::in() only works on char*
    return U'\0';
}

int main()
{
    // actual code uses std::istream so it works on strings, files etc.
    // but that's irrelevant for the question
    std::stringstream stream( u8"\u00FF" );
    iterator it( stream );
    iterator end;
    char32_t c = read_code_point( it, end );
    std::cout << std::boolalpha << ( c == U'\u00FF' ) << std::endl;
    return 0;
}

I am aware that Boost.Regex has an iterator for this, but I'd like to avoid boost libraries that are not header-only and this feels like something the STL should be capable of.

Answer 1

I don't think you can do this directly with codecvt_utf8 or any other standard library components. To use codecvt_utf8 you'd need to copy bytes from the iterator stream into a buffer and convert the buffer.

Something like this should work:

char32_t read_code_point( iterator& it, const iterator& end )
{
  char32_t result;
  char32_t* resend = &result + 1;
  char32_t* resnext = &result;
  char buf[7];  // room for 3-byte UTF-8 BOM and a 4-byte UTF-8 character
  char* bufpos = buf;
  const char* const bufend = std::end(buf);
  std::codecvt_utf8<char32_t> cvt;
  while (bufpos != bufend && it != end)
  {
    *bufpos++ = *it++;
    std::mbstate_t st{};
    const char* be = bufpos;
    const char* bn = buf;
    auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext);
    if (conv == std::codecvt_base::error)
      throw std::runtime_error("Invalid UTF-8 sequence");
    if (conv == std::codecvt_base::ok && bn == be)
      return result;
    // otherwise read another byte and try again
  }
  if (it == end)
    throw std::runtime_error("Incomplete UTF-8 sequence");
  throw std::runtime_error("No character read from first seven bytes");
}

This appears to do more work than necessary, re-scanning the whole UTF-8 sequence in [buf, bufpos) on every iteration (and making a virtual function call to codecvt_utf8::do_in ). In theory the codecvt_utf8::in implementation could read an incomplete multibyte sequence and store state information in the mbstate_t argument, so that the next call would resume from where the last one left off, only consuming new bytes, not re-processing the incomplete multibyte sequence that was already seen.

However, implementations are not required to use the mbstate_t argument to store state between calls and in practice at least one implementation of codecvt_utf8::in (the one I wrote for GCC) doesn't use it at all. From my experiments it seems that the libc++ implementation doesn't use it either. This means that they stop converting before an incomplete multibyte sequence, and leave the from_next pointer (the bn argument here) pointing to the beginning of that incomplete sequence, so that the next call should start from that position and (hopefully) provide enough additional bytes to complete the sequence and allow a complete Unicode character to be read and converted to char32_t . Because you are only trying to read a single codepoint, this means it does no conversion at all, because stopping before an incomplete multibyte sequence means stopping at the first byte.

It's possible that some implementations do use the mbstate_t argument, so you could modify the function above to handle that case as well, but to be portable it would still need to cope with implementations that ignore the mbstate_t . Supporting both types of implementation would complicate the function considerably, so I kept it simple and wrote a form that should work with all implementations, even if they do actually use the mbstate_t . Because you are only going to be reading up to 7 bytes at a time (in the worst case ... the average case may be only one or two bytes, depending on the input text) the cost of re-scanning the first few bytes every time shouldn't be huge.

To get better performance from codecvt_utf8 you should avoid converting one codepoint at a time, because it's designed for converting arrays of characters not individual ones. Since you always need to copy to a char buffer anyway you could copy larger chunks from the input iterator sequence and convert whole chunks. This would reduce the likelihood of seeing incomplete multibyte sequences, since only the last 1-3 bytes at the end of a chunk would need to be re-processed if the chunk ends in an incomplete sequence, everything earlier in the chunk would have been converted.

To get better performance reading single codepoints you should probably avoid codecvt_utf8 entirely and either roll your own (if you only need UTF-8 to UTF-32BE it's not so hard) or use a third-party library such as ICU.

UTF-8 to UTF-32 on iterators using the STL

Question

1 answers

solution1
3 ACCPTED 2015-07-07 14:04:45

UTF-8 to UTF-32 on iterators using the STL

Question

1 answers

solution1 3 ACCPTED 2015-07-07 14:04:45

solution1
3 ACCPTED 2015-07-07 14:04:45