正则表达式和编码攻击-内部编码在PHP中如何工作？

Question

I am using UTF-8 regex to get the parts of the Content-Type: header line, since I am in the habit to configure my servers to consistently use UTF-8. 我正在使用UTF-8正则表达式来获取Content-Type:标头行的各个部分，因为我习惯于将服务器配置为始终使用UTF-8。

// example type, actually this will be negotiated from request `Accept:` header line.
$content_type = 'TeXt/HtMl';
preg_match('~^([\w-]+\*?)/([\w-]+\*?)$~ui', $content_type, $matches);

I consider to load classes from a filesystem path built based on the subpattern matches. 我考虑从基于子模式匹配项构建的文件系统路径中加载类。

Is there any thinkable way to inject some '/../' by encoding attacks? 有什么可行的方法可以通过编码攻击来注入一些'/../' ？ How does internal encoding work in general? 内部编码一般如何工作？ Do I have to care what charset the request is encoded when processing data in PHP code or does the convertion work automatically and reliably? 处理PHP代码中的数据时，我是否需要关心请求编码的字符集，或者转换是否自动可靠地进行？ What else is to keep in mind with encoding security? 编码安全性还需牢记什么？ How can one ensure encoding in deployed code running on unknown systems? 如何确保在未知系统上运行的已部署代码中进行编码？

EDIT: As asked in comments, some further code could look like eg: 编辑：根据评论中的要求，一些其他代码可能类似于：

m1 = strtolower($matches[1]);
m2 = strtolower($matches[2]);
include_once "/path/to/project/content_handlers/{$m1}_{$m2}";

Remarks: My question was meant to be more general. 备注：我的问题是更笼统的。 Let's think about some scenario: The PHP script is encoded in UTF-8. 让我们考虑一些情况：PHP脚本以UTF-8编码。 The server's filesystem is encoded in character set A. The client manipulates the request to be sent in encoding B. Is there a potential risk that the accepted header is written in a way the preg_* functions do not recognize some '/../' (parent directory) but the filesystem? 服务器的文件系统以字符集A编码。客户端以编码B操纵要发送的请求。是否存在以preg_ *函数无法识别某些'/../'方式写入接受的标头的潜在风险'/../' （父目录）但是文件系统？ The question is not limited to the particular regex in the example. 问题不限于示例中的特定正则表达式。 Could an attacker be able to include arbitrary files present in the filesystem when not taking further precautions? 如果不采取进一步的预防措施，攻击者能否在文件系统中包含任意文件？

Remarks 2: In the provided example I cannot rely on http_negotiate_content_type since it is not sure if pecl_http is installed on the target server. 备注2：在所提供的例子，我不能靠http_negotiate_content_type因为如果安装在目标服务器上pecl_http实在拿不准。 There is a scripted polyfill as well. 也有脚本化的polyfill。 Again: This is not a question for a particular case. 再说一遍：这不是特定案例的问题。 I want to learn how to treat (even manipulated) client encodings in general. 我想学习一般如何处理（甚至操纵）客户端编码。

Remarks 3: Some similar problem (with SQL encoding attacks) is disussed here: Are PDO prepared statements sufficient to prevent SQL injection? 备注3：这里讨论了一些类似的问题（带有SQL编码攻击）： PDO准备好的语句是否足以防止SQL注入？ However, my question is about filesystem encoding. 但是，我的问题是关于文件系统编码的。 Could happen something similar? 会发生类似的事情吗？

Answer 1

I'll be bold and say that your code will effectively prevent malicious substrings. 我会大胆地说，您的代码将有效地防止恶意子字符串。 If someone is trying to sneak a sequence of characters, they will be smacked down by preg_match() . 如果有人试图偷偷摸摸地输入一系列字符，则它们会被preg_match() 。 Your use of anchors and character classes gives no wiggle-room. 您对锚点和角色类的使用不会产生任何回旋余地。 The pattern is nice and strict. 模式很好而且很严格。

Just a couple of notes: 只是一些注意事项：

\\w is already case-insensitive, so the i pattern modifier is not necessary. \\w已经不区分大小写，因此不需要i模式修饰符。
Your capture groups are stored in $matches[1] and $matches[2] . 您的捕获组存储在$matches[1]和$matches[2] 。 The fullstring match is in $matches[0] . 完整字符串匹配位于$matches[0] 。

Code: 码：

$content_type = 'TeXt/HtMl';
if (!preg_match('~^([\w-]+\*?)/([\w-]+\*?)$~u', $content_type, $matches)) {
    echo "invalid content type";
} else {
    var_export($matches);
}

Output: 输出：

array (
  0 => 'TeXt/HtMl',
  1 => 'TeXt',
  2 => 'HtMl',
)

正则表达式和编码攻击-内部编码在PHP中如何工作？

问题描述

1 个解决方案

解决方案1
1 2018-09-15 06:07:58

正则表达式和编码攻击-内部编码在PHP中如何工作？

问题描述

1 个解决方案

解决方案1 1 2018-09-15 06:07:58

解决方案1
1 2018-09-15 06:07:58