[英]Regex to match string with and without special/accented characters?
Is there a regular expression to match a specific string with and without special characters ? 是否有正则表达式匹配具有和不具有特殊字符的特定字符串 ? Special characters-insensitive, so to speak. 特殊字符 - 不敏感,可以这么说。
Like céra
will match cera
, and vice versa. 像céra
将匹配cera
,反之亦然。
Any ideas? 有任何想法吗?
Edit: I want to match specific strings with and without special/accented characters. 编辑:我想匹配具有和不带特殊/重音字符的特定字符串。 Not just any string/character. 不只是任何字符串/字符。
Test example: 测试示例:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
if (strpos($compareClientName, $this->search) !== false)
{
$clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}
Output: <span class="highlight">céra</span>
输出: <span class="highlight">céra</span>
As you can see, I want to highlight the specific search string. 如您所见,我想突出显示特定的搜索字符串。 However, I still want to display the original (accented) characters of the matched string. 但是, 我仍然希望显示匹配字符串的原始(重音)字符 。
I'll have to combine this with Michael Sivolobov's answer somehow, I guess. 我猜,我必须以某种方式将这与Michael Sivolobov的回答结合起来。
I think I'll have to work with a separate preg_match()
and preg_replace()
, right? 我想我必须使用单独的preg_match()
和preg_replace()
,对吗?
You can use the \\p{L}
pattern to match any letter. 您可以使用\\p{L}
模式匹配任何字母。
You have to use the u
modifier after the regular expression to enable unicode mode. 您必须在正则表达式后使用u
修饰符才能启用unicode模式。
Example : /\\p{L}+/u
示例: /\\p{L}+/u
Edit : 编辑:
Try something like this. 尝试这样的事情。 It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. 它应该用包含重音字母(单字符和双字符双字母)和非重音字母的搜索模式替换每个带有重音的字母。 You can then use the corrected search pattern to highlight your text. 然后,您可以使用更正的搜索模式突出显示您的文本。
function mbStringToArray($string)
{
$strlen = mb_strlen($string);
while($strlen)
{
$array[] = mb_substr($string, 0, 1, "UTF-8");
$string = mb_substr($string, 1, $strlen, "UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}
$clientName = 'céra';
$clientNameNoAccent = stripAccents($clientName);
$clientNameArray = mbStringToArray($clientName);
foreach($clientNameArray as $pos => &$char)
{
$charNA =$clientNameNoAccent[$pos];
if($char != $charNA)
{
$char = "(?:$char|$charNA|$charNA\p{M})";
}
}
$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra
$text = 'the client name is Céra but it could be Cera or céra too.';
$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);
echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.
If you want to know is there some accent or another mark on some letter you can check it by matching pattern \\p{M}
如果您想知道某些字母上是否有某些重音或其他标记,您可以通过匹配模式\\p{M}
来检查它
UPDATE UPDATE
You need to convert all your accented letters in pattern to group of alternatives: 您需要将模式中所有重音字母转换为替代组:
Eg céra -> c(?:é|e|e\\p{M})ra
例如céra -> c(?:é|e|e\\p{M})ra
Why did I add e\\p{M}
? 我为什么要添加e\\p{M}
? Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). 因为你的字母é可以是Unicode中的一个字符,可以是两个字符(e和重音符号)的组合。 e\\p{M}
matches e with grave accents (two separate Unicode characters) e\\p{M}
将e与严重重音符号 (两个单独的Unicode字符)匹配
As you convert your pattern to match all characters you can use it in your preg_match
当您转换模式以匹配所有字符时,您可以在preg_match
使用它
As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. 当您在其中一个注释中标记时,您不需要正则表达式,因为目标是查找特定字符串。 Why don't you use explode
? 你为什么不用explode
? Like that: 像那样:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
$pieces = explode($compareClientName, $this->search);
if (count($pieces) > 1)
{
$clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}
Edit: 编辑:
If your $search
variable may contain special characters too, why don'y you translit
it, and use mb_strpos
with $offset
? 如果你的$search
变量可能包含特殊字符也一样,为什么don'y你translit
它,并用mb_strpos
与$offset
? like this: 像这样:
$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
$highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
'<span class="highlight">'.
mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
$offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');
Update 2: 更新2:
It is important to use mb_
functions with instead of simple strlen
etc. This is because accented characters are stored using two or more bytes; 使用mb_
函数而不是简单的strlen
等很重要。这是因为重音字符使用两个或更多字节存储; Also always make sure that you use the right encoding, take a look at this for example: 另外,请务必确保使用正确的编码,例如:
echo strlen('é');
> 2
echo mb_strlen('é');
> 2
echo mb_internal_encoding();
> ISO-8859-1
echo mb_strlen('é', 'UTF-8');
> 1
mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.