繁体   English   中英

在 PHP 中转义 elasticsearch 特殊字符

[英]Escape elasticsearch special characters in PHP

我想创建一个函数,通过在 PHP 中的字符前添加 \\ 来转义 elasticsearch 特殊字符。 Elasticsearch 使用的特殊字符是: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \\ /

我对正则表达式不是很熟悉,但我发现了一段代码可以简单地删除特殊字符,但我更喜欢转义它们,因为它们可能是相关的。 我使用的代码:

$s_input = 'The next chars should be escaped: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / Did it work?';
$search_query = preg_replace('/(\+|\-|\=|\&|\||\!|\(|\)|\{|\}|\[|\]|\^|\"|\~|\*|\<|\>|\?|\:|\\\\)/', '', $s_input);

这输出:

The next chars should be escaped / Did it work

所以有两个问题:这段代码删除了特殊字符,而我想用\\转义它们。 此外:此代码不会转义\\ 有谁知道如何转义 Elasticsearch 特殊字符?

您可以将preg_match反向引用一起使用,因为 stribizhev 已经注意到它(最简单的方法):

$string = "The next chars should be escaped: + - = && || > < ! ( ) { } [ ] ^ \" ~ * ? : \ / Did it work?"; 

function escapeElasticReservedChars($string) {
    $regex = "/[\\+\\-\\=\\&\\|\\!\\(\\)\\{\\}\\[\\]\\^\\\"\\~\\*\\<\\>\\?\\:\\\\\\/]/";
    return preg_replace($regex, addslashes('\\$0'), $string);
}
echo escapeElasticReservedChars($string);

或使用preg_match_callback函数来实现。 感谢回调,您将能够拥有当前匹配并对其进行编辑。

将被调用并传递主题字符串中匹配元素数组的回调。 回调应该返回替换字符串。 这是回调签名:

这是在行动:

<?php 
$string = "The next chars should be escaped: + - = && || > < ! ( ) { } [ ] ^ \" ~ * ? : \ / Did it work?"; 

function escapeElasticSearchReservedChars($string) {
    $regex = "/[\\+\\-\\=\\&\\|\\!\\(\\)\\{\\}\\[\\]\\^\\\"\\~\\*\\<\\>\\?\\:\\\\\\/]/";
    $string = preg_replace_callback ($regex, 
        function ($matches) { 
            return "\\" . $matches[0]; 
        }, $string); 
    return $string;
}
echo escapeElasticSearchReservedChars($string);

输出: The next chars should be escaped\\: \\+ \\- \\= \\&\\& \\|\\| \\> \\< \\! \\( \\) \\{ \\} \\[ \\] \\^ \\" \\~ \\* \\? \\: \\\\ \\/ Did it work\\? The next chars should be escaped\\: \\+ \\- \\= \\&\\& \\|\\| \\> \\< \\! \\( \\) \\{ \\} \\[ \\] \\^ \\" \\~ \\* \\? \\: \\\\ \\/ Did it work\\?

如果有人正在寻找稍微冗长(但可读!)的解决方案:

public function escapeElasticsearchValue($searchValue)
{
    $searchValue = str_replace('\\', '\\\\', $searchValue);
    $searchValue = str_replace('*', '\\*', $searchValue);
    $searchValue = str_replace('?', '\\?', $searchValue);
    $searchValue = str_replace('+', '\\+', $searchValue);
    $searchValue = str_replace('-', '\\-', $searchValue);
    $searchValue = str_replace('&&', '\\&&', $searchValue);
    $searchValue = str_replace('||', '\\||', $searchValue);
    $searchValue = str_replace('!', '\\!', $searchValue);
    $searchValue = str_replace('(', '\\(', $searchValue);
    $searchValue = str_replace(')', '\\)', $searchValue);
    $searchValue = str_replace('{', '\\{', $searchValue);
    $searchValue = str_replace('}', '\\}', $searchValue);
    $searchValue = str_replace('[', '\\[', $searchValue);
    $searchValue = str_replace(']', '\\]', $searchValue);
    $searchValue = str_replace('^', '\\^', $searchValue);
    $searchValue = str_replace('~', '\\~', $searchValue);
    $searchValue = str_replace(':', '\\:', $searchValue);
    $searchValue = str_replace('"', '\\"', $searchValue);
    $searchValue = str_replace('=', '\\=', $searchValue);
    $searchValue = str_replace('/', '\\/', $searchValue);

    // < and > can’t be escaped at all. The only way to prevent them from
    // attempting to create a range query is to remove them from the query
    // string entirely
    $searchValue = str_replace('<', '', $searchValue);
    $searchValue = str_replace('>', '', $searchValue);

    return $searchValue;
}

似乎给出的答案实际上都没有遵循文档,所以这是另一个正确编码任何不受信任的输入的答案:

/**
 * @param string $s untrusted user input
 * @return string safe string to be used in `query_string` argument to elasticsearch
 */
function escapeForElasticSearch($s)
{
    static $keys = array();
    static $values = array();
    if (!$keys)
    {
        # https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-query-string-query.html#_reserved_characters
        $replacements = array(
            "\\" => "\\\\", # must be done first to not double encode later backslashes!
            "+" => "\\+",
            "-" => "\\-",
            "=" => "\\=",
            "&" => "\\&",
            "|" => "\\|",
            ">" => "", # cannot be safely encoded
            "<" => "", # cannot be safely encoded
            "!" => "\\!",
            "(" => "\\(",
            ")" => "\\)",
            "{" => "\\{",
            "}" => "\\}",
            "[" => "\\[",
            "]" => "\\]",
            "^" => "\\^",
            "\"" => "\\\"",
            "~" => "\\~",
            "*" => "\\*",
            "?" => "\\?",
            ":" => "\\:",
            "/" => "\\/",
        );
        $keys = array_keys($replacements);
        $values = array_values($replacements);
    }
    return str_replace($keys, $values, $s);
}

注意&| 并不是单独的特殊,但正确处理这些字符的奇数比仅仅编码这些字符的每个实例更困难。

完全公开,我从未使用过弹性搜索,我的建议不是来自个人经验,甚至不是用弹性搜索测试过的。 我根据我对正则表达式和字符串操作技能的了解来生成这个建议。 如果有人发现漏洞,我将很高兴收到您的评论。

我的片段:

  • 首先删除字符串中所有出现的<>然后
  • 检查单次出现的保留字符列表中的字符或紧跟同一个字符的&符号或管道符——所有这些限定字符都用反斜杠转义。

代码:(演示

$string = "To be escaped: + - = && || > < ! ( ) { } [ ] ^ \" ~ * ? : \ / triple ||| and split '&<&'"; 

echo escapeElasticSearchReservedChars($string);

function escapeElasticSearchReservedChars(string $string): string
{
    return preg_replace(
        [
            '_[<>]+_',
            '_[-+=!(){}[\]^"~*?:\\/\\\\]|&(?=&)|\|(?=\|)_',
        ],
        [
            '',
            '\\\\$0',
        ],
        $string
    );
}

输出:

To be escaped\: \+ \- \= \&& \||   \! \( \) \{ \} \[ \] \^ \" \~ \* \? \: \\ \/ triple \|\|| and split '\&&'

先去掉<>的原因是为了让别人不能试图破解替换的设计并试图传入|>| 否则会阻止两个连续管道的适当转义(在删除>之后)。

简单的方法是使用单个字符类进行匹配。
唯一的问题是使用什么作为分隔符(为了可读性)。

使用@作为正则表达式分隔符,它的

查找: '@[-+=&|><!(){}[\\]^"~*?:\\\\\\/]@'
替换: '\\\\$0'


但是,如果实际字符已经被转义怎么办?
然后怎样呢?

一个解决方案是找到那些没有转义的。

查找: '@(?<!\\\\\\)(?:\\\\\\\\\\\\\\)*\\K(?:[-+=&|><!(){}[\\]^"~*?:/]|\\\\\\(?!\\\\\\))@'
替换: '\\\\$0'

格式化:

 (?<! \\ )                     # Not an escape behind 
 (?: \\ \\ )*                  # Possible even number of escapeds
 \K                            # Don't include the previous escapes in match
 (?:
      [-+=&|><!(){}[\]^"~*?:/]      # Either 1 of these special characters
   |                              # or,
      \\                            # An escape character that is
      (?! \\ )                      # not followed by escape itself.
 )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM