简体   繁体   English

Java中的HTTP头编码/解码

[英]HTTP headers encoding/decoding in Java

A custom HTTP header is being passed to a Servlet application for authentication purposes. 正在将自定义HTTP标头传递给Servlet应用程序以进行身份​​验证。 The header value must be able to contain accents and other non-ASCII characters, so must be in a certain encoding (ideally UTF-8). 标头值必须能够包含重音和其他非ASCII字符,因此必须采用特定编码(理想情况下为UTF-8)。

I am provided with this piece of Java code by the developers who control the authentication environment: 控制身份验证环境的开发人员向我提供了这段Java代码:

String firstName = request.getHeader("my-custom-header"); 
String decodedFirstName = new String(firstName.getBytes(),"UTF-8");

But this code doesn't look right to me: it presupposes the encoding of the header value, when it seemed to me that there was a proper way of specifying an encoding for header values (from MIME I believe). 但是这段代码对我来说看起来并不合适:它假定了头值的编码,当我觉得有一种正确的方法来指定头值的编码时(我认为来自MIME)。

Here is my question: what is the right way (tm) of dealing with custom header values that need to support a UTF-8 encoding: 这是我的问题:处理需要支持UTF-8编码的自定义标头值的正确方法(tm)是什么:

  • on the wire (how the header looks like over the wire) 在电线上(标题在电线上的样子)
  • from the decoding point of view (how to decode it using the Java Servlet API, and can we assume that request.getHeader() already properly does the decoding) 从解码的角度来看(如何使用Java Servlet API对其进行解码,我们可以假设request.getHeader()已经正确地进行了解码)

Here is an environment independent code sample to treat headers as UTF-8 in case you can't change your service: 这是一个与环境无关的代码示例,如果您无法更改服务,则将标头视为UTF-8:

String valueAsISO = request.getHeader("my-custom-header"); 
String valueAsUTF8 = new String(firstName.getBytes("ISO8859-1"),"UTF-8");

Again: RFC 2047 is not implemented in practice. 再说一次:RFC 2047在实践中没有实现。 The next revision of HTTP/1.1 is going to remove any mention of it. HTTP / 1.1的下一个版本将删除任何提及它。

So, if you need to transport non-ASCII characters, the safest way is to encode them into a sequence of ASCII, such as the "Slug" header in the Atom Publishing Protocol. 因此,如果您需要传输非ASCII字符,最安全的方法是将它们编码为ASCII序列,例如Atom发布协议中的“Slug”标头。

The HTTPbis working group is aware of the issue, and the latest drafts get rid of all the language with respect to TEXT and RFC 2047 encoding -- it is not used in practice over HTTP. HTTPbis工作组知道这个问题,并且最新的草案摆脱了与TEXT和RFC 2047编码相关的所有语言 - 它实际上并未在HTTP上使用。

See http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 for the whole story. 有关整个故事,请参见http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74

See the HTTP spec for the rules, which says in section 2.2 有关规则,请参阅HTTP规范 ,如2.2节所述

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. TEXT规则仅用于描述性字段内容和不打算由消息解析器解释的值。 Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. 只有当根据RFC 2047 [14]的规则进行编码时,* TEXT的字才能包含ISO-8859-1 [22]以外的字符集中的字符。

The above code will not correctly decode an RFC2047 encoding string, leading me to believe that the service doesn't correctly follow the spec, and they just embeding raw utf-8 data in the header. 上面的代码将无法正确解码RFC2047编码字符串,导致我认为该服务没有正确遵循规范,他们只是在头文件中嵌入原始utf-8数据。

As mentioned already the first look should always go to the HTTP 1.1 spec (RFC 2616). 如前所述,第一眼看起来应该始终遵循HTTP 1.1规范 (RFC 2616)。 It says that text in header values must use the MIME encoding as defined RFC 2047 if it contains characters from character sets other than ISO-8859-1. 它表示如果头文件中的文本包含来自ISO-8859-1以外的字符集的字符,那么头文件中的文本必须使用定义的RFC 2047中的MIME编码。

So here's a plus for you. 所以这对你来说是一个加分。 If your requirements are covered by the ISO-8859-1 charset then you just put your characters into your request/response messages. 如果您的要求由ISO-8859-1字符集涵盖,那么您只需将字符放入请求/响应消息中即可。 Otherwise MIME encoding is the only alternative. 否则MIME编码是唯一的选择。

As long as the user agent sends the values to your custom headers according to these rules you wont have to worry about decoding them. 只要用户代理根据这些规则将值发送到您的自定义标头,您就不必担心解码它们。 That's what the Servlet API should do. 这就是Servlet API应该做的事情。


However, there's a more basic reason why your code sniplet doesn't do what it's supposed to. 但是,有一个更基本的原因可以解释为什么你的代码片段没有做到它应该做的事情。 The first line fetches the header value as a Java string. 第一行将标头值作为Java字符串获取。 As we know it's represented as UTF8 internally so at this point the HTTP request message parsing is already done and finished. 我们知道它在内部表示为UTF8,所以此时HTTP请求消息解析已经完成并完成。

The next line fetches the byte array of this string. 下一行获取此字符串的字节数组。 Since no encoding was specified (IMHO this method with no argument should have been deprecated long ago), the current system default encoding is used, which is usually not UTF8 and then the array is again converted as being UTF8 encoded. 由于没有指定编码(恕我直言这个没有参数的方法很久以前就已经弃用了),所以使用当前的系统默认编码,通常不是UTF8,然后再次将数组转换为UTF8编码。 Outch. Outch。

Thanks for the answers. 谢谢你的回答。 It seems that the ideal would be to follow the proper HTTP header encoding as per RFC 2047. Header values in UTF-8 on the wire would look something like this: 似乎理想的是按照RFC 2047遵循正确的HTTP头编码。线路上的UTF-8中的头部值看起来像这样:

=?UTF-8?Q?...?=

Now here is the funny thing: it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! 现在这里有趣的是:似乎Tomcat 5.5或6都没有按照RFC 2047正确解码HTTP头! The Tomcat code assumes everywhere that header values use ISO-8859-1. Tomcat代码假设每个标头值都使用ISO-8859-1。

So for Tomcat, specifically, I will work around this by writing a filter which handles the proper decoding of the header values. 因此,对于Tomcat,我将通过编写一个处理头值正确解码的过滤器来解决这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM