如何使用Indy解码utf-8 Unicode字符

Question

I have a TIdHttpServer application. 我有一个TIdHttpServer应用程序。 I have a simple html document with special characters: 我有一个带有特殊字符的简单html文档：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">


    <head>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
        <title>This is the title</title>
    </head>

    <body>
        <form method="post">
            <p>
                <input name="name" value="Все данные по веб-сайту" />
                <input type="submit" value="submit" />
            </p>
        </form>
    </body>
</html>

I serve this page and process the post. 我服务于此页面并处理帖子。 My "Get" code is below. 我的“获取”代码如下。 Problem is I am unable to decode the %hh data properly. 问题是我无法正确解码％hh数据。

procedure TForm3.Get(AContext: TIdContext;
  ARequestInfo: TIdHTTPRequestInfo; AResponseInfo: TIdHTTPResponseInfo);
var
  mFileName: String;
  txtFile: TextFile;
begin
  if ARequestInfo.Params.values['name']<>'' then begin
    AssignFile( txtFile , ChangeFileExt(ParamStr(0),'.log') );
    Append( TxtFile );
    WriteLn(TxtFile,'Unparsed:'+ARequestInfo.UnparsedParams);
    WriteLn(TxtFile,'Parsed:'+ARequestInfo.Params.values['name']);
    MyDecodeAndSetParams(ARequestInfo);
    WriteLn(TxtFile,'Decoded:'+ARequestInfo.Params.values['name']);
    System.Close( TxtFile );
  end ;
  mFileName := ExtractFileDir(ParamStr(0))+'\inputform.txt';
  AResponseInfo.ContentStream := TFileStream.Create(mFileName, fmOpenRead);

end;

The MyDecodeAndSetParams function: MyDecodeAndSetParams函数：

procedure MyDecodeAndSetParams(ARequestInfo: TIdHTTPRequestInfo);
var
  i, j : Integer;
  value,s: string;
  LEncoding: IIdTextEncoding;
begin
  if IsHeaderMediaType(ARequestInfo.ContentType, 'application/x-www-form-urlencoded') then
  begin
    value := ARequestInfo.FormParams;
//    LEncoding := CharsetToEncoding(ARequestInfo.CharSet);
    if ARequestInfo.CharSet <> '' then
      LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
    else
      LEncoding := IndyTextEncoding_UTF8;
  end else
  begin
    value := ARequestInfo.QueryParams;
    LEncoding := IndyTextEncoding_UTF8;
  end;

  ARequestInfo.Params.BeginUpdate;
  try
    ARequestInfo.Params.Clear;
    i := 1;
    while i <= Length(value) do
    begin
      j := i;
      while (j <= Length(value)) and (value[j] <> '&') do
      begin
        Inc(j);
      end;
      s := StringReplace(Copy(value, i, j-i), '+', ' ', [rfReplaceAll]);
      ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding));
      i := j + 1;
    end;
  finally
    ARequestInfo.Params.EndUpdate;
  end;
end;

The output in my file is as follows: 我文件中的输出如下：

Unparsed:name=%D0%92%D1%81%D0%B5+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D0%B5+%D0%BF%D0%BE+%D0%B2%D0%B5%D0%B1-%D1%81%D0%B0%D0%B9%D1%82%D1%83
Parsed:οсе даннϿе по веб-сайϿϿ
Decoded:οсе даннϿе по веб-сайϿϿ

I can take the Unparsed data and decode it using this decoder and it returns the string properly: 我可以获取未解析的数据并使用此解码器对其进行解码，它会正确返回字符串：

Все данные по веб-сайту Вседанныеповеб-сайту

What do I need to do so that I can properly decode the params to what they were on the form? 我需要怎么做才能正确地将参数解码为表格上的参数？

Answer 1

If AResponseInfo.CharSet is blank (because the client did not send a charset in the HTTP Content-Type header), CharsetToEncoding('') will return Indy's native 8bit charset rather than UTF-8. 如果AResponseInfo.CharSet为空（因为客户端未在HTTP Content-Type标头中发送字符集），则CharsetToEncoding('')将返回Indy的本机8位字符集，而不是UTF-8。 That is why your data is not being decoded properly. 这就是为什么您的数据未正确解码的原因。

For application/x-www-form-urlencoded , a charset is not always sent in the HTTP headers, as the client may assume the server knows the charset to expect based on the charset it sends the HTML in. It is also possible that the client might send a charset in the posted form data instead, such as in a _charset_ field. 对于application/x-www-form-urlencoded ，字符集并不总是在HTTP标头中发送，因为客户端可能会基于发送HTML的字符集假定服务器知道期望的字符集。客户端可能会改为在已发布的表单数据中发送字符集，例如在_charset_字段中。

Try changing this: 尝试更改此：

LEncoding := CharsetToEncoding(ARequestInfo.CharSet);

To this: 对此：

if ARequestInfo.CharSet <> '' then
  LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
else
  LEncoding := IndyTextEncoding_UTF8;

This way, you default to UTF-8 unless the client sends an explicit charset. 这样，除非客户端发送显式字符集，否则您默认使用UTF-8。

Update : If you are using a pre-Unicode version of Delphi (2007 or earlier), Indy uses AnsiString instead of UnicodeString , so TIdURI.URLDecode() will first decode the input to Unicode using the specified AByteEncoding parameter (defaulting to IndyTextEncoding_UTF8 if none is specified), and will then convert the Unicode data to ANSI using the specified ADestEncoding parameter (defaulting to IndyTextEncoding_OSDefault if none is specified). 更新：如果使用的是Delphi的Unicode之前版本（2007或更早版本），Indy使用AnsiString而不是UnicodeString ，因此TIdURI.URLDecode()将首先使用指定的AByteEncoding参数将输入解码为Unicode（如果没有，则默认为IndyTextEncoding_UTF8 ，然后将使用指定的ADestEncoding参数（如果未指定，则默认为IndyTextEncoding_OSDefault ）将Unicode数据转换为ANSI。

The Russian input you have shown decodes properly to Unicode when decoded as UTF-8, but can easily lose characters (turning them into '?' ) during the conversion to ANSI if your code is running on a machine that does not use a Russian charset at the OS layer, such as ISO-8859-5 or KOI8-R. 您显示的俄语输入在解码为UTF-8时可以正确解码为Unicode，但是如果您的代码运行在不使用俄语字符集的计算机上，则在转换为ANSI的过程中很容易丢失字符（将它们转换为'?' ）。在OS层，例如ISO-8859-5或KOI8-R。

To ensure a correct conversion, you would have to specify the desired AnsiString encoding on those machines, eg: 为了确保正确的转换，您必须在这些计算机上指定所需的AnsiString编码，例如：

var
  LEncoding, LAnsiEncoding: IIdTextEncoding;
...

LEncoding := IndyTextEncoding_UTF8;
LAnsiEncoding := CharsetToEncoding('ISO-8859-5'); // or 'KOI8-R', etc
...
ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding, LAnsiEncoding));

In Unicode versions of Delphi (2009 and later), Indy uses UnicodeString instead of AnsiString , so there is no ADestEncoding parameter present. 在Delphi（2009年及更高版本）的Unicode版本中，Indy使用UnicodeString而不是AnsiString ，因此不存在ADestEncoding参数。

如何使用Indy解码utf-8 Unicode字符

问题描述

1 个解决方案

解决方案1
5 2017-02-02 02:49:22

如何使用Indy解码utf-8 Unicode字符

问题描述

1 个解决方案

解决方案1 5 2017-02-02 02:49:22

解决方案1
5 2017-02-02 02:49:22