简体   繁体   中英

Issue with Spanish string encoding

I need help on changing the codification of a string copied and pasted from clipboard...

The curious string is "español":

$problematicString = "español"; //copied and pasted from a filename
$okString          = "español"; //typed

echo md5($problematicString)."<br>";
echo md5($okString)."<br>";

This is the output:

c9ae1d88242473e112ede8df2bdd6802
5d971adb0ba260af6a126a2ade4dd133

Why are the md5() outputs different for the same strings?

I've tried changing both strings using: mb_convert_encoding($string, "ISO-8859-1", "UTF-8") but the output is still different.

i need to fix the problematicString programmatically so that it shows the same hash as the other string

Why are the md5 different for the same strings ?

They are not the same string. In the first case the tilde is on the 'o':

$problematicString = "español"

In the second case, the tilde is on the 'n':

$okString = "español";

That's why the hashes don't match.

The reason being is that the first part contains a hidden unicode being:

&#771;

Pulled from my editor:

$problematicString = "espan&#771;ol"; which is what it's actually showing.

It's actually a tilde ~ .

These symbols, which are most of the non-ascii symbols useful for standard phonetic transcription of English, are drawn from several regions of the Unicode chart: from Latin-1 Supplement, Latin Extended-A and B,IPA Extensions, Combining Diacritical Mark, and Greek (for the theta). All of these pages are supported by lucida sans unicode, a TrueType font that Microsoft has bundled with recent products. Sadly, Bitstream's mother-of-all-TTFs Cyberbit does not support the IPA Extensions. These values can be entered manually as character entities or assigned to hot keys, buttons, or whatever the browser allows. Word97 can access the font via the symbol table under Insert.

Another way to write this font is to use Wincalis uniedit, which will write the Unicode values directly into the file. Then "This is phonetically transcribed" is represented in strange alphabet soup which is converted by the browser into [ðɪs ɪz fɘnɛɾɘkli trænskraibd] (look at this in a plain text editor to see the soup). For any serious or extensive transcription work, an editor like Wincalis would prove handy--you can even customize the IPA keyboard supplied.

If you want the file to trigger Unicode UTF-8 decoding in the browser, you must preface this META tag:

with the following under "Diacritics" :

̃ #771 nasalized

As @BeetleJuice said, they are not the same string. Here's another way to understand this: reduce the data to just these two strings:

"español";
"español";

Then run the od command against them. Observe that the hex characters are different:

0000000      6522    7073    6e61    83cc    6c6f    3b22    220a    7365
           "   e   s   p   a   n    ̃  **   o   l   "   ;  \n   "   e   s
0000020      6170    b1c3    6c6f    3b22    0a20
           p   a   ñ  **   o   l   "   ;      \n
0000032

In the first string the is actually an n and a combining diacritic tilde ( http://www.fileformat.info/info/unicode/char/0303/index.htm ). In the second string it's an ñ ( http://www.fileformat.info/info/unicode/char/f1/index.htm ), one character. You can see that if you use backspace to delete characters and you'll see that in the first one it takes 2 presses, one to delete the tilde, the other one for 'n'.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM