[英]How to Decode utf-8 unicode characters with Indy
I have a TIdHttpServer application. 我有一个TIdHttpServer应用程序。 I have a simple html document with special characters: 我有一个带有特殊字符的简单html文档:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>This is the title</title>
</head>
<body>
<form method="post">
<p>
<input name="name" value="Все данные по веб-сайту" />
<input type="submit" value="submit" />
</p>
</form>
</body>
</html>
I serve this page and process the post. 我服务于此页面并处理帖子。 My "Get" code is below. 我的“获取”代码如下。 Problem is I am unable to decode the %hh data properly. 问题是我无法正确解码%hh数据。
procedure TForm3.Get(AContext: TIdContext;
ARequestInfo: TIdHTTPRequestInfo; AResponseInfo: TIdHTTPResponseInfo);
var
mFileName: String;
txtFile: TextFile;
begin
if ARequestInfo.Params.values['name']<>'' then begin
AssignFile( txtFile , ChangeFileExt(ParamStr(0),'.log') );
Append( TxtFile );
WriteLn(TxtFile,'Unparsed:'+ARequestInfo.UnparsedParams);
WriteLn(TxtFile,'Parsed:'+ARequestInfo.Params.values['name']);
MyDecodeAndSetParams(ARequestInfo);
WriteLn(TxtFile,'Decoded:'+ARequestInfo.Params.values['name']);
System.Close( TxtFile );
end ;
mFileName := ExtractFileDir(ParamStr(0))+'\inputform.txt';
AResponseInfo.ContentStream := TFileStream.Create(mFileName, fmOpenRead);
end;
The MyDecodeAndSetParams function: MyDecodeAndSetParams函数:
procedure MyDecodeAndSetParams(ARequestInfo: TIdHTTPRequestInfo);
var
i, j : Integer;
value,s: string;
LEncoding: IIdTextEncoding;
begin
if IsHeaderMediaType(ARequestInfo.ContentType, 'application/x-www-form-urlencoded') then
begin
value := ARequestInfo.FormParams;
// LEncoding := CharsetToEncoding(ARequestInfo.CharSet);
if ARequestInfo.CharSet <> '' then
LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
else
LEncoding := IndyTextEncoding_UTF8;
end else
begin
value := ARequestInfo.QueryParams;
LEncoding := IndyTextEncoding_UTF8;
end;
ARequestInfo.Params.BeginUpdate;
try
ARequestInfo.Params.Clear;
i := 1;
while i <= Length(value) do
begin
j := i;
while (j <= Length(value)) and (value[j] <> '&') do
begin
Inc(j);
end;
s := StringReplace(Copy(value, i, j-i), '+', ' ', [rfReplaceAll]);
ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding));
i := j + 1;
end;
finally
ARequestInfo.Params.EndUpdate;
end;
end;
The output in my file is as follows: 我文件中的输出如下:
Unparsed:name=%D0%92%D1%81%D0%B5+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D0%B5+%D0%BF%D0%BE+%D0%B2%D0%B5%D0%B1-%D1%81%D0%B0%D0%B9%D1%82%D1%83
Parsed:οсе даннϿе по веб-сайϿϿ
Decoded:οсе даннϿе по веб-сайϿϿ
I can take the Unparsed data and decode it using this decoder and it returns the string properly: 我可以获取未解析的数据并使用此解码器对其进行解码 ,它会正确返回字符串:
Все данные по веб-сайту Вседанныеповеб-сайту
What do I need to do so that I can properly decode the params to what they were on the form? 我需要怎么做才能正确地将参数解码为表格上的参数?
If AResponseInfo.CharSet
is blank (because the client did not send a charset in the HTTP Content-Type
header), CharsetToEncoding('')
will return Indy's native 8bit charset rather than UTF-8. 如果AResponseInfo.CharSet
为空(因为客户端未在HTTP Content-Type
标头中发送字符集),则CharsetToEncoding('')
将返回Indy的本机8位字符集,而不是UTF-8。 That is why your data is not being decoded properly. 这就是为什么您的数据未正确解码的原因。
For application/x-www-form-urlencoded
, a charset is not always sent in the HTTP headers, as the client may assume the server knows the charset to expect based on the charset it sends the HTML in. It is also possible that the client might send a charset in the posted form data instead, such as in a _charset_
field. 对于application/x-www-form-urlencoded
,字符集并不总是在HTTP标头中发送,因为客户端可能会基于发送HTML的字符集假定服务器知道期望的字符集。客户端可能会改为在已发布的表单数据中发送字符集,例如在_charset_
字段中。
Try changing this: 尝试更改此:
LEncoding := CharsetToEncoding(ARequestInfo.CharSet);
To this: 对此:
if ARequestInfo.CharSet <> '' then
LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
else
LEncoding := IndyTextEncoding_UTF8;
This way, you default to UTF-8 unless the client sends an explicit charset. 这样,除非客户端发送显式字符集,否则您默认使用UTF-8。
Update : If you are using a pre-Unicode version of Delphi (2007 or earlier), Indy uses AnsiString
instead of UnicodeString
, so TIdURI.URLDecode()
will first decode the input to Unicode using the specified AByteEncoding
parameter (defaulting to IndyTextEncoding_UTF8
if none is specified), and will then convert the Unicode data to ANSI using the specified ADestEncoding
parameter (defaulting to IndyTextEncoding_OSDefault
if none is specified). 更新 :如果使用的是Delphi的Unicode之前版本(2007或更早版本),Indy使用AnsiString
而不是UnicodeString
,因此TIdURI.URLDecode()
将首先使用指定的AByteEncoding
参数将输入解码为Unicode(如果没有,则默认为IndyTextEncoding_UTF8
,然后将使用指定的ADestEncoding
参数(如果未指定,则默认为IndyTextEncoding_OSDefault
)将Unicode数据转换为ANSI。
The Russian input you have shown decodes properly to Unicode when decoded as UTF-8, but can easily lose characters (turning them into '?'
) during the conversion to ANSI if your code is running on a machine that does not use a Russian charset at the OS layer, such as ISO-8859-5 or KOI8-R. 您显示的俄语输入在解码为UTF-8时可以正确解码为Unicode,但是如果您的代码运行在不使用俄语字符集的计算机上,则在转换为ANSI的过程中很容易丢失字符(将它们转换为'?'
)。在OS层,例如ISO-8859-5或KOI8-R。
To ensure a correct conversion, you would have to specify the desired AnsiString
encoding on those machines, eg: 为了确保正确的转换,您必须在这些计算机上指定所需的AnsiString
编码,例如:
var
LEncoding, LAnsiEncoding: IIdTextEncoding;
...
LEncoding := IndyTextEncoding_UTF8;
LAnsiEncoding := CharsetToEncoding('ISO-8859-5'); // or 'KOI8-R', etc
...
ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding, LAnsiEncoding));
In Unicode versions of Delphi (2009 and later), Indy uses UnicodeString
instead of AnsiString
, so there is no ADestEncoding
parameter present. 在Delphi(2009年及更高版本)的Unicode版本中,Indy使用UnicodeString
而不是AnsiString
,因此不存在ADestEncoding
参数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.