简体   繁体   English

正则表达式匹配字符串是否有特殊/重音字符?

[英]Regex to match string with and without special/accented characters?

Is there a regular expression to match a specific string with and without special characters ? 是否有正则表达式匹配具有和不具有特殊字符的特定字符串 Special characters-insensitive, so to speak. 特殊字符 - 不敏感,可以这么说。

Like céra will match cera , and vice versa. céra将匹配cera ,反之亦然。

Any ideas? 有任何想法吗?

Edit: I want to match specific strings with and without special/accented characters. 编辑:我想匹配具有和不带特殊/重音字符的特定字符串。 Not just any string/character. 不只是任何字符串/字符。

Test example: 测试示例:

$clientName   = 'céra';
$this->search = 'cera';

$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search      = strtolower($this->search);

if (strpos($compareClientName, $this->search) !== false)
{
    $clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}

Output: <span class="highlight">céra</span> 输出: <span class="highlight">céra</span>

As you can see, I want to highlight the specific search string. 如您所见,我想突出显示特定的搜索字符串。 However, I still want to display the original (accented) characters of the matched string. 但是, 我仍然希望显示匹配字符串的原始(重音)字符

I'll have to combine this with Michael Sivolobov's answer somehow, I guess. 我猜,我必须以某种方式将这与Michael Sivolobov的回答结合起来。

I think I'll have to work with a separate preg_match() and preg_replace() , right? 我想我必须使用单独的preg_match()preg_replace() ,对吗?

You can use the \\p{L} pattern to match any letter. 您可以使用\\p{L}模式匹配任何字母。

Source 资源

You have to use the u modifier after the regular expression to enable unicode mode. 您必须在正则表达式后使用u修饰符才能启用unicode模式。

Example : /\\p{L}+/u 示例: /\\p{L}+/u

Edit : 编辑:

Try something like this. 尝试这样的事情。 It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. 它应该用包含重音字母(单字符和双字符双字母)和非重音字母的搜索模式替换每个带有重音的字母。 You can then use the corrected search pattern to highlight your text. 然后,您可以使用更正的搜索模式突出显示您的文本。

function mbStringToArray($string)
{
    $strlen = mb_strlen($string);
    while($strlen)
    {
        $array[] = mb_substr($string, 0, 1, "UTF-8");
        $string = mb_substr($string, 1, $strlen, "UTF-8");
        $strlen = mb_strlen($string);
    }
    return $array;
}

// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
    return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}

$clientName = 'céra';

$clientNameNoAccent = stripAccents($clientName);

$clientNameArray = mbStringToArray($clientName);

foreach($clientNameArray as $pos => &$char)
{
    $charNA =$clientNameNoAccent[$pos];
    if($char != $charNA)
    {
        $char = "(?:$char|$charNA|$charNA\p{M})";
    }
}

$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra

$text = 'the client name is Céra but it could be Cera or céra too.';

$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);

echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.

If you want to know is there some accent or another mark on some letter you can check it by matching pattern \\p{M} 如果您想知道某些字母上是否有某些重音或其他标记,您可以通过匹配模式\\p{M}来检查它

UPDATE UPDATE

You need to convert all your accented letters in pattern to group of alternatives: 您需要将模式中所有重音字母转换为替代组:

Eg céra -> c(?:é|e|e\\p{M})ra 例如céra -> c(?:é|e|e\\p{M})ra

Why did I add e\\p{M} ? 我为什么要添加e\\p{M} Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). 因为你的字母é可以是Unicode中的一个字符,可以是两个字符(e和重音符号)的组合。 e\\p{M} matches e with grave accents (two separate Unicode characters) e\\p{M}e与严重重音符号 (两个单独的Unicode字符)匹配

As you convert your pattern to match all characters you can use it in your preg_match 当您转换模式以匹配所有字符时,您可以在preg_match使用它

As you can see here , POSIX equivalence class is for matching characters with the same collating order that can be done by below regex: 正如您在此处所看到的, POSIX equivalence class用于匹配具有相同整理顺序的字符,可以通过以下正则表达式完成:

[=a=]

This will match á and ä as well as a depending on your locale. 这将匹配áä以及a根据您所在地区。

As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. 当您在其中一个注释中标记时,您不需要正则表达式,因为目标是查找特定字符串。 Why don't you use explode ? 你为什么不用explode Like that: 像那样:

$clientName   = 'céra';
$this->search = 'cera';

$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search      = strtolower($this->search);

$pieces = explode($compareClientName, $this->search);

if (count($pieces) > 1)
{
    $clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}

Edit: 编辑:

If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset ? 如果你的$search变量可能包含特殊字符也一样,为什么don'y你translit它,并用mb_strpos$offset like this: 像这样:

$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
    $highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
         '<span class="highlight">'.
         mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
    $offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');

Update 2: 更新2:

It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; 使用mb_函数而不是简单的strlen等很重要。这是因为重音字符使用两个或更多字节存储; Also always make sure that you use the right encoding, take a look at this for example: 另外,请务必确保使用正确的编码,例如:

echo strlen('é');
> 2

echo mb_strlen('é');
> 2

echo mb_internal_encoding();
> ISO-8859-1

echo mb_strlen('é', 'UTF-8');
> 1

mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM