Hope this is a better...
I set "mysqli_set_charset($conn,"utf8");" before doing a standard query SELECT-FROM-WHERE. This query performs an accent insentive compare.
str_ireplace(search, replace, text) search does an accent sensitive compare. I would need search to do an accents insentive compare.
I want to highlight the word "Français". I replace "Français" by
<mark>Français</mark>
but at the same time I want to replace "Francais" by
<mark>Francais</mark>
older post:
I use a simple way to highlight some text:
$markReplace = "<mark>" . $wordToSearch . "</mark>";
$fullText = str_ireplace($wordToSearch, $markReplace, $fullText);
echo $fullText;
It works fine, the problem is that sometimes the same $wordToSearch can have a accent or not. For example "huître-huitre", "Francais-Français", "echo-écho" because of typo errors. And contrary to MySql, str_ireplace doesn't detect a letter with an accent as the same letter without the accent.
$unwanted_array = array('Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
A solution that would use something like this doesn't work because it will change all the accents in $fullText. I need to keep the original words when I echo $fullText.
Can't figure out the solution.
Thanks... Andy
Okay so first of all, converting one set of characters [eg: accented] to an equivalent form [eg: unaccented] according to some rules is called "transliteration".
The intl extension provides a handy transliterator class that we can invoke with simply:
$translit = Transliterator::create('Latin-ASCII;');
$foo = $translit->transliterate('Français'); // Francais
So painstakingly maintaining a list of "unwanted" characters and their replacements is not necessary.
Secondly, accented characters are not always single codepoints, ç
may be represented by either the unified codepoint, or a two-codepoint sequence consisting of a plain c
and a combining mark representing the accent.
The unit comprising a single visual glyph is referred to as a Grapheme.
Thirdly, the your requirements [case-insensitve and accent-insensitive] essentially requires that we have to build our own custom string matching procedure.
First, we need a GraphemeIterator to traverse the UTF8 string properly. intl's IntlBreakIterator::createCharacterInstance()
does the heavy lifting, but returns byte offsets, so lets wrap that in another iterator that actually pops out graphemes:
class GraphemeIterator implements \Iterator {
protected $i, $string, $offset;
public function __construct($string) {
$this->string = $string;
$i = IntlBreakIterator::createCharacterInstance();
$i->setText($string);
$this->i = $i->getIterator();
$this->init();
}
protected function init() {
$this->offset = $this->i->current();
$this->i->next();
}
public function length() {
return grapheme_strlen($this->string);
}
public function tell() {
return [ $this->offset, $this->i->current()];
}
// Iterator Interface functions
public function current(): mixed {
return substr($this->string, $this->offset, $this->i->current() - $this->offset);
}
public function key(): mixed {
return $this->i->key();
}
public function next(): void {
$this->offset = $this->i->current();
$this->i->next();
}
public function rewind(): void {
$this->i->rewind();
$this->init();
}
public function valid(): bool {
return $this->i->valid();
}
}
Now we need something that can compare two strings after applying some arbitrary comparisons:
class TransformingComparator {
protected $transforms = [];
public function __construct(array $transforms) {
foreach($transforms as $transform) {
$this->addTransform($transform);
}
}
protected function addTransform(callable $transform) {
$this->transforms[] = $transform;
}
protected function transform($input) {
$output = $input;
foreach($this->transforms as $transform) {
$output = $transform($output);
}
return $output;
}
public function compare($a, $b) {
return $this->transform($a) <=> $this->transform($b);
}
}
and a function that can use those to locate the occurrences of the search string:
function findAllInGraphemeString($needle, $haystack, $comparator) {
$t_it = new GraphemeIterator($haystack);
$s_it = new GraphemeIterator($needle);
$s = 0;
$sl = $s_it->length();
$out = [];
$cur = [];
for( $t=0, $tl=$t_it->length(); $t<$tl; ++$t ) {
if( $comparator($t_it->current(), $s_it->current()) === 0 ) {
if( empty($cur) ) {
$cur[] = $t_it->tell()[0];
}
if( ++$s >= $sl ) {
$cur[] = $t_it->tell()[1];
$out[] = $cur;
$cur = [];
$s = 0;
$s_it->rewind();
} else {
$s_it->next();
}
$t_it->next();
} else {
// on aborted partial match restart from current
if( count($cur) != 0 ) {
$s = 0;
$cur=[];
--$t;
} else {
$t_it->next();
}
$s_it->rewind();
}
}
return $out;
}
and finally a function that can perform the actual transformation:
function transformSubstrings(string $text, array $boundaries, callable $transform) {
$output = '';
$offset = 0;
foreach($boundaries as $bound) {
$output .= substr($text, $offset, $bound[0]-$offset);
$output .= $transform(substr($text, $bound[0], $bound[1]-$bound[0]));
$offset = $bound[1];
}
return $output . substr($text, $bound[1]);
}
We can finally put this together as::
$translit = Transliterator::create('Latin-ASCII;');
$transforms = [
[$translit, 'transliterate'], // remove accents
'mb_strtolower'
];
$tc = new TransformingComparator($transforms);
$text = 'lorem ipsum frFrançais dolor sit français amet adsplicing dit';
$search = 'Francais';
echo transformSubstrings(
$text,
findAllInGraphemeString($search, $text, [$tc, 'compare']),
function($a){
return sprintf('<mark>%s</mark>', $a);
}
);
Output:
lorem ipsum <mark>Français</mark> dolor sit <mark>français</mark> amet adsplicing dit <mark>francais</mark>
and yes, I got nerd sniped hard on this one.
Edit: Now that you've mentioned collations it occurs to me that intl has a Collator
class, and it looks like TransformingComparator
is now longer relevant and can be substituted out like:
$col = new Collator('fr-ca'); // or whatever locale you're using
$col->setStrength(Collator::PRIMARY);
// ...
transformSubstrings(
$text,
findAllInGraphemeString($search, $text, [$col, 'compare']),
function($a){
return sprintf('<mark>%s</mark>', $a);
}
)
Which will likely also be a fair bit faster, since it's likely using a lookup instead of running all the transforms.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.