简体   繁体   中英

How to remove the embedded formatting of an UTF-8 string?

I'm querying the Facebook API in PHP to get a list of posts and display it on a website.

// $facebook is an instance of Facebook\Facebook
$response = $facebook->get('posts?fields=id,message,created_time,full_picture,permalink_url,status_type&limit=20');
$graphEdge = $response->getGraphEdge();
$posts = [];

foreach ($graphEdge as $post) {
    $message = $post->getField('message');
}

The text returned by the call looks like the picture below:

在此处输入图像描述

My problem is that sometimes the formatting of the text seems to be embedded in the characters themselves. For eg., the text "Montélimar - aux Portes du Soleil" uses a different font than what's defined in CSS and I can't force it to use a different style. The HTML looks like this:

<p>
  Profitez d’un cadre de vie idéal pour faire construire votre maison individuelle sur la commune de 𝐌𝐨𝐧𝐭𝐞́𝐥𝐢𝐦𝐚𝐫 - 𝐚𝐮𝐱 𝐏𝐨𝐫𝐭𝐞𝐬 𝐝𝐮 𝐒𝐨𝐥𝐞𝐢𝐥 ☀️
  Notre lotissement « 𝐋𝐞 𝐃𝐨𝐦𝐚𝐢𝐧𝐞 𝐝𝐞 𝐆𝐞́𝐫𝐲 » ...
</p>

We even store the data in a JSON object and it looks like this (see the "description" field):

[
    {
        "pageName": "---",
        "type": "---",
        "date": "---",
        "description": "Profitez d’un cadre de vie idéal pour faire construire votre maison individuelle sur la commune de 𝐌𝐨𝐧𝐭𝐞́𝐥𝐢𝐦𝐚𝐫 - 𝐚𝐮𝐱 𝐏𝐨𝐫𝐭𝐞𝐬 𝐝𝐮 𝐒𝐨𝐥𝐞𝐢𝐥 ☀️ Notre lotissement « 𝐋𝐞 𝐃𝐨𝐦𝐚𝐢𝐧𝐞 𝐝𝐞 𝐆𝐞́𝐫𝐲 » ...",
        "time": 0000,
        "thumbnail": "---",
        "url": "---",
        "img": "---"
    }
]

As you can see, some text has a default styling that I can't figure how to get rid of. I've tried to re-encode the text to UTF-8 via PHP using mb_convert_encoding(); but this doesn't solve the problem because the string is already UTF-8.

How can I remove this formatting? Is this even formatting, or just special UTF-8 symbols?

If you copy one of the characters (the "M" of "Montélimar" for eg.) and try to look for it in the Unicode Character Table ( https://unicode-table.com/en/1D40C/ ), you will find that it is not a letter but a "Mathematical Bold Capital M", represented by these symbols:

  • Unicode number: U+1D40C
  • HTML-code: &#119820;

So this is a problem with your content itself and not an encoding problem. Everything is fine and I don't think you can anything do to fix this appearance issue.

If the UTF-8 special characters get in the way, you can try converting the string to ASCII with iconv . However, there is a risk that the individual characters and, under certain circumstances, important information will be lost.

$strUTF8mb4 = "Profitez d’un cadre de vie idéal pour faire construire votre maison individuelle sur la commune de 𝐌𝐨𝐧𝐭𝐞́𝐥𝐢𝐦𝐚𝐫 - 𝐚𝐮𝐱 𝐏𝐨𝐫𝐭𝐞𝐬 𝐝𝐮 𝐒𝐨𝐥𝐞𝐢𝐥 ☀️ Notre lotissement « 𝐋𝐞 𝐃𝐨𝐦𝐚𝐢𝐧𝐞 𝐝𝐞 𝐆𝐞́𝐫𝐲 » ...";
$strASCII = iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $strUTF8mb4);
//string(181) "Profitez d'un cadre de vie id'eal pour faire construire votre maison individuelle sur la commune de Montelimar - aux Portes du Soleil Notre lotissement << Le Domaine de Gery >> ..."

Especially for the French language, this code could produce slightly better results:

$strIso = iconv("UTF-8", "ISO-8859-15//TRANSLIT//IGNORE", $strUTF8mb4);
$strUtf8 = iconv("ISO-8859-15", "UTF-8", $strIso);
//"Profitez d'un cadre de vie idéal pour faire construire votre maison individuelle sur la commune de Montelimar - aux Portes du Soleil Notre lotissement « Le Domaine de Gery » ..."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM