简体   繁体   中英

GSOAP malforms utf-8 in std::string

I have C++ server using GSOAP. One of APIs accepts a string.

<message name="concatRequest">
  <part name="a" type="ns:password"/><!-- ns__concat::a -->
  <part name="b" type="xsd:string"/><!-- ns__concat::b -->
</message>

int billon__concat( struct soap *soap, std::string a, std::string b, std::string &result )
{
//    std::cout <<"PACZPAN A:"<<a<<" B:"<<b <<std::endl;
    std::cout <<"PACZPAN B[0..3]: " << (int)b[0] << " " << (int)b[1] << " " << (int)b[2] << " " <<(int)b[3] << std::endl;
    std::cout <<"PACZPAN B[0..3]: " << (char)b[0] << " " << (char)b[1] << " " << (char)b[2] << " " <<(char)b[3] << std::endl;
    result = a + b;
  //  std::cout <<"PACZPAN res:"<<result <<std::endl;
    return SOAP_OK;
}

ns::password is just a string as well.

Now I send a request with argument B='PŁOCK' by 2 different means which in wireshark shows either as 'PŁOCK' or P&#x141;OCK , so I think both are correct. Also logging of gsoap prints:

POST / HTTP/1.1
Accept-Encoding: gzip,deflate
Content-Type: text/xml;charset=UTF-8
SOAPAction: ""
Content-Length: 471
Host: localhost:8080
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1.1 (java 1.5)

<soapenv:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:calc">
   <soapenv:Header/>
   <soapenv:Body>
      <urn:concat soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
         <a xsi:type="urn:password">      </a>
         <b xsi:type="xsd:string">PŁOCK</b>
      </urn:concat>
   </soapenv:Body>
</soapenv:Envelope>

When server receives it, it becomes PAOCK. No bad bytes outside of ASCII, just different letter.

PACZPAN B[0..3]: 80 65 79 67
PACZPAN B[0..3]: P A O C

I don't care that std::string does not handle unicode well. I want it to handle bytes sent as they are.

I could add mapping in typemap.dat: xsd__string = | std::wstring xsd__string = | std::wstring , but I don't want to use std::wstring - it is not utf-8 anyway.

GSOAP by default does not handle characters outside of latin set: doc . It can be changed during initialization of soap context with a flag :

struct soap *soap = soap_new1( SOAP_C_UTFSTRING );

The main problem with the UTF-8 handling of GSOAP is, it converts every UTF-8 character to a Latin1 character and do not care if a conversation is possible or not.

GSOAP pretend to handle UTF-8 correctly by accepting and producing <xml ... encoding="UTF-8"> but it doesn't. In the GSOAP methods soap_pututf8() and soap_getutf8() in stdsoap2.cpp there is no way for erroneous character conversions intended, so everything will be converted. Therefore an UTF-8 character 'Ł' becomes silently the absolut wrong character 'A'. In my opinion this is a nightmare for national language support. I have told the author of GSOAP but he does not see a problem with this habit.

So in my opinion the best solution for a small interface is to use SOAP_C_UTFSTRING as you mentioned and call iconv() for every string on your own. In the case of a result of ((size_t) -1) youre GSOAP-Methode can return soap_sender_fault(soap, strerror(errno), NULL); (see man 3 iconv)

Otherwise a GSOAP-plugin or a Webserver-filter which does the charset-conversation before your GSOAP application would be an opportunity.

Now you have the chance to handle any character set you want correctly. See the following example if you want to handle ISO-8859-16 (because of the letter 'Ł'). https://de.wikipedia.org/wiki/ISO_8859

#include <iconv.h>
#include <stdexcept>    // std::invalid_argument
#include <string.h>     // strerror
#include <system_error>

class Iconv {
    const std::string m_to, m_from; // remember for error messages
    const iconv_t m_cd; // the encapsulated conversion descriptor
    const size_t m_multiplier; // one UTF-8 character can need up to 4 bytes
public:
    Iconv(const char *to, const char *from, size_t multiplier)
        : m_to(to), m_from(from), m_cd(iconv_open(to, from)), m_multiplier(multiplier) {
        if (m_cd == ((iconv_t) -1))
            throw std::system_error(errno, std::system_category(), m_from + " to " + m_to);
    }
    ~Iconv() { iconv_close(m_cd); }

    std::string operator()(std::string in) const {
        size_t inbytesleft = in.length();
        char *inbuf = &in[0];
        size_t outbytesleft = m_multiplier * inbytesleft + 1;
        std::string out;
        out.resize(outbytesleft);
        char *outbuf = &out[0];
        if (iconv(m_cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft) == ((size_t) -1)) {
            if (errno == EILSEQ || errno == EINVAL)
                throw std::invalid_argument(m_from + " to " + m_to + ": " + strerror(errno));
            else
                throw std::system_error(errno, std::system_category(), m_from + " to " + m_to);
        }
        out.resize(out.length() - outbytesleft);
        return out;
    }
};

int billon__concat(struct soap *soap, std::string a, std::string b, std::string &result) {
    try {
        static const Iconv utf8_to_iso885916("ISO-8859-16", "UTF-8", 1);
        a = utf8_to_iso885916(a);
        b = utf8_to_iso885916(b);

        // do your fancy stuff with ISO-8859-16 strings ...
        result = a + b;

        static const Iconv iso885916_to_utf8("UTF-8", "ISO-8859-16", 4);
        result = iso885916_to_utf8(result);
        return SOAP_OK;
    } catch (const std::invalid_argument& ex) {
        return soap_sender_fault(soap, ex.what(), NULL);
    } catch (const std::exception& ex) {
        return soap_receiver_fault(soap, ex.what(), NULL);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM