简体   繁体   English

Nginx是否支持路径中的原始unicode?

[英]Does Nginx support raw unicode in paths?

Browsers url encode unicode characters to %## by default. 浏览器url默认将unicode字符编码为%##。

However, I can make a request via CURL to http://localhost:8080/与 and nginx sees the path as " ". 但是,我可以通过CURL向http://localhost:8080/与发出请求,nginx将路径视为“ ”。 How is this possible? 这怎么可能? Does Nginx allow arbitrary unicode in it's path then? Nginx是否允许在其路径中使用任意unicode?

For example, with this config I can set an additional header to see what nginx saw: 例如,使用此配置,我可以设置一个额外的标头,以查看nginx看到了什么:

location ~* "(*UTF8)([^\w/\.\-\\% ])" {
        add_header "response" $1;
        return 200;
}

Request: 请求:

* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:44:51 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: 与                                        <--- SEE THIS?
< 
* Connection #0 to host localhost left intact

However, when I remove the UTF8 marker then the header contains "?" 但是,当我删除UTF8标记时,标题包含“?” as if nginx can't understand the character (or is only reading the first byte). 好像nginx无法理解字符(或只是读取第一个字节)。

location ~* "([^\w/\.\-\\% ])" {
        add_header "response" $1;
        return 200;
}

Request: 请求:

* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /与 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8080
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Tue, 20 Jan 2015 21:45:35 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< response: ?
< 
* Connection #0 to host localhost left intact

Note: Changing this non-utf-8 regex to capture one-or-more ([^...]+) also results in the response: 与 header being sent (byte vs multibyte strings?) 注意:更改此非utf-8正则表达式以捕获一个或多个 ([^...]+)也会导致response: 与发送标头(字节与多字节字符串?)

Logging either regex match to a file results in an request entry like: 将正则表达式匹配记录到文件会导致请求条目如下:

GET /\xE4\xB8\x8E HTTP/1.1

Apart from the regexes and terminal configuration, this doesn't have anything to do with Unicode. 除了正则表达式和终端配置之外,这与Unicode没有任何关系。 The short answer to your question is: nginx doesn't care about Unicode encodings but it does accept non-ASCII bytes in URLs. 对你的问题的简短回答是:nginx不关心Unicode编码,但它接受URL中的非ASCII字节。

Here's the long answer that explains what you're seeing. 这是解释您所看到的内容的长答案。 If you enter the command 如果输入命令

curl http://localhost:8080/与

and your terminal uses UTF-8 as encoding, it will encode the character 与 (U+4E0E) into the three-byte UTF-8 sequence 并且您的终端使用UTF-8作为编码,它将字符与(U + 4E0E)编码为三字节UTF-8序列

0xE4 0xB8 0x8E

curl apparently accepts non-ASCII bytes in URLs, although they're technically illegal. curl显然接受URL中的非ASCII字节,尽管它们在技术上是非法的。 It will then send an HTTP request with these non-ASCII bytes. 然后它将发送带有这些非ASCII字节的HTTP请求。 Since there is no default way to display these bytes, I'll use bolded C-style hex escapes like \\x00 from now on to represent them. 由于没有默认的方式来显示这些字节,因此我将使用粗体C风格的十六进制转义 ,如\\ x00,从现在开始代表它们。 So the request line sent by curl looks like: 所以curl发送的请求行看起来像:

GET / \\xE4\\xB8\\x8E HTTP/1.1 GET / \\ xE4 \\ xB8 \\ x8E HTTP / 1.1

That's three bytes after the first / . 这是第一个/之后的三个字节。 If the terminal on which you view your logs also supports UTF-8, this will be displayed on your screen as 如果您查看日志的终端也支持UTF-8,则会在屏幕上显示为

GET /与 HTTP/1.1 GET /与HTTP / 1.1

But this does not mean that there are Unicode characters in your HTTP request. 但这并不意味着您的HTTP请求中有Unicode字符。 On the HTTP level, we only deal with bytes. 在HTTP级别,我们只处理字节。

nginx also seems to happily accept non-ASCII bytes in URLs. nginx似乎也乐于接受URL中的非ASCII字节。 Then the following regex 然后是以下正则表达式

(*UTF8)([^\w/\.\-\\% ])

working in UTF-8 mode treats the byte sequence \\xE4\\xB8\\x8E as character 与 which matches \\w , so the header will be 在UTF-8模式下工作将字节序列\\ xE4 \\ xB8 \\ x8E视为与\\w匹配的字符,因此标​​题将是

response: \\xE4\\xB8\\x8E 响应: \\ xE4 \\ xB8 \\ x8E

which your terminal display as 你的终端显示为

response: 与 回应:与

On the other hand, the regex 另一方面,正则表达式

([^\w/\.\-\\% ])

works directly on bytes, so it will only match the first byte of your path, or nothing at all. 直接在字节上工作,因此它只匹配路径的第一个字节,或者根本不匹配。 For some reason, it thinks that the first byte of the sequence \\xE4\\xB8\\x8E matches \\w (maybe because it assumes Latin1 or Windows-1252 strings), so the header will be: 由于某种原因,它认为序列\\ xE4 \\ xB8 \\ x8E的第一个字节匹配\\w (可能因为它假定为Latin1或Windows-1252字符串),因此标题将为:

response: \\xE4 回复: \\ xE4

which your terminal decides to display as 您的终端决定显示为

response: ? 回应:?

because the byte \\xE4 followed by a newline is invalid UTF-8. 因为字节\\ xE4后跟换行符是无效的UTF-8。 The regex ([^\\w/\\.\\-\\\\% ])+ matches the whole byte sequence, so it produces the same result as the UTF-8 regex. 正则表达式([^\\w/\\.\\-\\\\% ])+匹配整个字节序列,因此它产生与UTF-8正则表达式相同的结果。

If you see something like 如果你看到类似的东西

GET /\xE4\xB8\x8E HTTP/1.1

in your logs, that's because the authors of the logging code decided to use escape sequence for non-ASCII bytes. 在您的日志中,这是因为日志代码的作者决定将转义序列用于非ASCII字节。 In general, this is a good idea because it always produces the same output regardless of terminal configuration and really shows what's going on: Your HTTP request simply contains non-ASCII bytes. 一般来说,这是一个好主意,因为无论终端配置如何,它总是产生相同的输出,并且真正显示正在发生的事情:您的HTTP请求只包含非ASCII字节。

Doesn't your own testing already seem to answer your question? 你自己的测试似乎没有回答你的问题吗?

Yes, nginx does support Unicode in paths. 是的,nginx确实支持路径中的Unicode。

As a point of discussion, nginx will normalise URLs prior to location matching, as pointed out in the documentation at http://nginx.org/r/location . 作为讨论的一点,nginx将在位置匹配之前规范化URL,如http://nginx.org/r/location中的文档所述。 Which is why different "weird" requests (like those containing ../ ; or those encoding ? as %3F , thus making it part of the filename, instead of signifying the parameters known as $args ) may still end up being served by a single location that does not look like a one-to-one match to the naked eye. 这就是为什么不同的“奇怪”请求(如那些包含../ ;或那些编码?作为%3F ,从而使其成为文件名的一部分,而不是表示名为$args )可能仍然最终由一个服务单一位置,看起来不像肉眼一对一匹配。

This normalisation may also explain why the "same" string appears differently within access_log (pre-normalised) vs. error_log (normalised). 这种规范化也可以解释为什么“相同”字符串在access_log (预规范化)与error_log (规范化)中的显示方式不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM