简体   繁体   English

如何使用Indy解码utf-8 Unicode字符

[英]How to Decode utf-8 unicode characters with Indy

I have a TIdHttpServer application. 我有一个TIdHttpServer应用程序。 I have a simple html document with special characters: 我有一个带有特殊字符的简单html文档:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">


    <head>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
        <title>This is the title</title>
    </head>

    <body>
        <form method="post">
            <p>
                <input name="name" value="Все данные по веб-сайту" />
                <input type="submit" value="submit" />
            </p>
        </form>
    </body>
</html>

I serve this page and process the post. 我服务于此页面并处理帖子。 My "Get" code is below. 我的“获取”代码如下。 Problem is I am unable to decode the %hh data properly. 问题是我无法正确解码%hh数据。

procedure TForm3.Get(AContext: TIdContext;
  ARequestInfo: TIdHTTPRequestInfo; AResponseInfo: TIdHTTPResponseInfo);
var
  mFileName: String;
  txtFile: TextFile;
begin
  if ARequestInfo.Params.values['name']<>'' then begin
    AssignFile( txtFile , ChangeFileExt(ParamStr(0),'.log') );
    Append( TxtFile );
    WriteLn(TxtFile,'Unparsed:'+ARequestInfo.UnparsedParams);
    WriteLn(TxtFile,'Parsed:'+ARequestInfo.Params.values['name']);
    MyDecodeAndSetParams(ARequestInfo);
    WriteLn(TxtFile,'Decoded:'+ARequestInfo.Params.values['name']);
    System.Close( TxtFile );
  end ;
  mFileName := ExtractFileDir(ParamStr(0))+'\inputform.txt';
  AResponseInfo.ContentStream := TFileStream.Create(mFileName, fmOpenRead);

end;

The MyDecodeAndSetParams function: MyDecodeAndSetParams函数:

procedure MyDecodeAndSetParams(ARequestInfo: TIdHTTPRequestInfo);
var
  i, j : Integer;
  value,s: string;
  LEncoding: IIdTextEncoding;
begin
  if IsHeaderMediaType(ARequestInfo.ContentType, 'application/x-www-form-urlencoded') then
  begin
    value := ARequestInfo.FormParams;
//    LEncoding := CharsetToEncoding(ARequestInfo.CharSet);
    if ARequestInfo.CharSet <> '' then
      LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
    else
      LEncoding := IndyTextEncoding_UTF8;
  end else
  begin
    value := ARequestInfo.QueryParams;
    LEncoding := IndyTextEncoding_UTF8;
  end;

  ARequestInfo.Params.BeginUpdate;
  try
    ARequestInfo.Params.Clear;
    i := 1;
    while i <= Length(value) do
    begin
      j := i;
      while (j <= Length(value)) and (value[j] <> '&') do
      begin
        Inc(j);
      end;
      s := StringReplace(Copy(value, i, j-i), '+', ' ', [rfReplaceAll]);
      ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding));
      i := j + 1;
    end;
  finally
    ARequestInfo.Params.EndUpdate;
  end;
end;

The output in my file is as follows: 我文件中的输出如下:

Unparsed:name=%D0%92%D1%81%D0%B5+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D0%B5+%D0%BF%D0%BE+%D0%B2%D0%B5%D0%B1-%D1%81%D0%B0%D0%B9%D1%82%D1%83
Parsed:οсе даннϿе по веб-сайϿϿ
Decoded:οсе даннϿе по веб-сайϿϿ

I can take the Unparsed data and decode it using this decoder and it returns the string properly: 我可以获取未解析的数据并使用此解码器对其进行解码 ,它会正确返回字符串:

Все данные по веб-сайту Вседанныеповеб-сайту

What do I need to do so that I can properly decode the params to what they were on the form? 我需要怎么做才能正确地将参数解码为表格上的参数?

If AResponseInfo.CharSet is blank (because the client did not send a charset in the HTTP Content-Type header), CharsetToEncoding('') will return Indy's native 8bit charset rather than UTF-8. 如果AResponseInfo.CharSet为空(因为客户端未在HTTP Content-Type标头中发送字符集),则CharsetToEncoding('')将返回Indy的本机8位字符集,而不是UTF-8。 That is why your data is not being decoded properly. 这就是为什么您的数据未正确解码的原因。

For application/x-www-form-urlencoded , a charset is not always sent in the HTTP headers, as the client may assume the server knows the charset to expect based on the charset it sends the HTML in. It is also possible that the client might send a charset in the posted form data instead, such as in a _charset_ field. 对于application/x-www-form-urlencoded ,字符集并不总是在HTTP标头中发送,因为客户端可能会基于发送HTML的字符集假定服务器知道期望的字符集。客户端可能会改为在已发布的表单数据中发送字符集,例如在_charset_字段中。

Try changing this: 尝试更改此:

LEncoding := CharsetToEncoding(ARequestInfo.CharSet);

To this: 对此:

if ARequestInfo.CharSet <> '' then
  LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
else
  LEncoding := IndyTextEncoding_UTF8;

This way, you default to UTF-8 unless the client sends an explicit charset. 这样,除非客户端发送显式字符集,否则您默认使用UTF-8。


Update : If you are using a pre-Unicode version of Delphi (2007 or earlier), Indy uses AnsiString instead of UnicodeString , so TIdURI.URLDecode() will first decode the input to Unicode using the specified AByteEncoding parameter (defaulting to IndyTextEncoding_UTF8 if none is specified), and will then convert the Unicode data to ANSI using the specified ADestEncoding parameter (defaulting to IndyTextEncoding_OSDefault if none is specified). 更新 :如果使用的是Delphi的Unicode之前版本(2007或更早版本),Indy使用AnsiString而不是UnicodeString ,因此TIdURI.URLDecode()将首先使用指定的AByteEncoding参数将输入解码为Unicode(如果没有,则默认为IndyTextEncoding_UTF8 ,然后将使用指定的ADestEncoding参数(如果未指定,则默认为IndyTextEncoding_OSDefault )将Unicode数据转换为ANSI。

The Russian input you have shown decodes properly to Unicode when decoded as UTF-8, but can easily lose characters (turning them into '?' ) during the conversion to ANSI if your code is running on a machine that does not use a Russian charset at the OS layer, such as ISO-8859-5 or KOI8-R. 您显示的俄语输入在解码为UTF-8时可以正确解码为Unicode,但是如果您的代码运行在不使用俄语字符集的计算机上,则在转换为ANSI的过程中很容易丢失字符(将它们转换为'?' )。在OS层,例如ISO-8859-5或KOI8-R。

To ensure a correct conversion, you would have to specify the desired AnsiString encoding on those machines, eg: 为了确保正确的转换,您必须在这些计算机上指定所需的AnsiString编码,例如:

var
  LEncoding, LAnsiEncoding: IIdTextEncoding;
...

LEncoding := IndyTextEncoding_UTF8;
LAnsiEncoding := CharsetToEncoding('ISO-8859-5'); // or 'KOI8-R', etc
...
ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding, LAnsiEncoding));

In Unicode versions of Delphi (2009 and later), Indy uses UnicodeString instead of AnsiString , so there is no ADestEncoding parameter present. 在Delphi(2009年及更高版本)的Unicode版本中,Indy使用UnicodeString而不是AnsiString ,因此不存在ADestEncoding参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何解码服务器对utf-8的响应? - How to decode server response to utf-8? idhttp indy post,用utf-8做请求(参数) - idhttp indy post, do request (parameters) with utf-8 Flutter 转换 UTF-8 字符 HTTP 后 - Flutter convert UTF-8 characters HTTP post 在生产环境中对UTF-8字符的处理方式有所不同 - UTF-8 Characters handled differently on production environment json编码为UTF-8字符。 如何在Python请求中作为json处理 - json encoded as UTF-8 characters. How do I process as json in Python Requests UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置0的字节0xff:尝试编码时无效的起始字节(&#39;utf-8&#39;) - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte when I tried to encode('utf-8') UTF-8字符在HTTP响应pl / sql中损坏 - UTF-8 Characters get corrupted in HTTP response pl/sql HTTP POST查询表单参数中的UTF-8字符 - UTF-8 characters in HTTP POST query form parameters UTF-8 字符在 HTTP 基本身份验证用户名中损坏 - UTF-8 characters mangled in HTTP Basic Auth username UTF-8字符在HTTP Basic Auth用户名中混乱-&gt; 5年后 - UTF-8 characters mangled in HTTP Basic Auth username -> 5 years later
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM