Problem and original data
I have a json data which contain some HTML entities to encode some special characters (mostly from French language, like “é”, “ç”, “à”, etc.) and for html tags. This is a sample of my json data:
{
"data1": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l&rsquo;abbaye du Val-Beno&icirc;t</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}
Desired result
{
"data1": "<p>Le cartulaire de 1380-1381 copié au XVIIe siècle et aujourd’hui perdu<strong>*</strong>.",
"data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l’abbaye du Val-Benoît</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}
So, I wish to simply decode all HTML entities back to their respective characters and tags. I try to do this with php.
There is my current code:
/* decode data */
$jsonData = '{
"data1": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l&rsquo;abbaye du Val-Beno&icirc;t</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}';
$data = json_decode($jsonData, true);
/* change html entities and re-encode data */
$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");
header('Content-Type: application/json; Charset="UTF-8"');
echo json_encode($data, JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
My current result:
{
"data1": "<p>Le cartulaire de 1380-1381 copié au XVIIe siècle et aujourd’hui perdu<strong>*</strong>.",
"data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l’abbaye du Val-Benoît</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}
So, HTML tags were well transformed. But the HTML entities for French special characters stay here (but instead, for example &eacute;
now I have é
).
Question. How I can convert HTML entities back to characters?
You can test it online here: https://www.tehplayground.com/Z4uB5KIPPo4UQ4h1
Many thanks in advance!
UPDATE:
Finally, my data is more complex than I was imagining. In the same data some characters were preserved as “é”, “à”, “ç” etc. and some other characters was converted to HTM entities. So I can have something like this:
{
"someData1":
{
"data1":
[
"ecclésiastique"
],
"data2": "s&eacute;culiers"
},
"someData2":
[
{
"anotherData1": "ecclésiastique",
"anotherData2": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"anotherData3":
{
"text1": "texte here",
"text2": "texte here"
}
},
{
"anotherData1": "ecclésiastique",
"anotherData2": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"anotherData3":
{
"text1": "texte here",
"text2": "texte here"
}
}
]
}
So, I suppose I have to 1) Convert all data to HTML entities; 2) Convert all HTML entities back to characters…
There is my current code:
# Get data
$jsonData = '{
"someData1":
{
"data1":
[
"ecclésiastique"
],
"data2": "s&eacute;culiers"
},
"someData2":
[
{
"anotherData1": "ecclésiastique",
"anotherData2": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"anotherData3":
{
"text1": "texte here",
"text2": "texte here"
}
},
{
"anotherData1": "ecclésiastique",
"anotherData2": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"anotherData3":
{
"text1": "texte here",
"text2": "texte here"
}
}
]
}';
$data = json_decode($jsonData, true);
# Convert character encoding
$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");
# Convert HTML entities to their corresponding characters
function html_decode(&$item){
$item = html_entity_decode($item);
}
array_walk_recursive($data, 'html_decode');
var_dump ($data);
So, I succeed in reversing the encoding. These who was an HTML entities become special characters, and those who was a special character become HTML entities.
But I don't have any idea how to get only special characters.
Online test: https://www.tehplayground.com/bVo3Jr5O7L9p4MXX
There is the solution. I needed to
&
to &
to standardize encoding systems;There is the final code. Many thanks to all for all your comments and suggestions.
# Get data
$jsonData = '{
"someData1":
{
"data1":
[
"ecclésiastique"
],
"data2": "s&eacute;culiers"
},
"someData2":
[
{
"anotherData1": "ecclésiastique",
"anotherData2": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"anotherData3":
{
"text1": "texte here",
"text2": "texte here"
}
},
{
"anotherData1": "ecclésiastique",
"anotherData2": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
"anotherData3":
{
"text1": "texte here",
"text2": "texte here"
}
}
]
}';
$data = json_decode($jsonData, true);
# Replace & by &
array_walk_recursive($data, function(&$item, $key) {
if(is_string($item)) {
$item = str_replace("&", "&", $item);
}
});
# Convert HTML entities to their corresponding characters
array_walk_recursive($data, function(&$item, $key) {
if(is_string($item)) {
$item = html_entity_decode($item);
}
});
var_dump ($data);
Online test: https://www.tehplayground.com/ms1KxR2tywOxIS9J
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.