简体   繁体   中英

Convert HTML entities in Json back to characters

Problem and original data

I have a json data which contain some HTML entities to encode some special characters (mostly from French language, like “é”, “ç”, “à”, etc.) and for html tags. This is a sample of my json data:

{
    "data1": "<p>Le cartulaire de 1380-1381 copié au XVIIe siècle et aujourd’hui perdu<strong>*</strong>.",
    "data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l’abbaye du Val-Benoît</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}

Desired result

{
    "data1": "<p>Le cartulaire de 1380-1381 copié au XVIIe siècle et aujourd’hui perdu<strong>*</strong>.",
    "data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l’abbaye du Val-Benoît</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}

So, I wish to simply decode all HTML entities back to their respective characters and tags. I try to do this with php.

There is my current code:

/* decode data */

$jsonData = '{
        "data1": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
        "data2": "&lt;p&gt;&lt;strong&gt;*&lt;/strong&gt; Joseph CUVELIER, &lt;em&gt;Cartulaire de l&amp;rsquo;abbaye du Val-Beno&amp;icirc;t&lt;/em&gt;, Bruxelles, 1906, p. XI-XXVII.&lt;/p&gt;"
    }';
$data = json_decode($jsonData, true);

/* change html entities and re-encode data */

$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");
header('Content-Type: application/json; Charset="UTF-8"');
echo json_encode($data, JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);

My current result:

{
    "data1": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
    "data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l&rsquo;abbaye du Val-Beno&icirc;t</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}

So, HTML tags were well transformed. But the HTML entities for French special characters stay here (but instead, for example &amp;eacute; now I have &eacute; ).

Question. How I can convert HTML entities back to characters?

You can test it online here: https://www.tehplayground.com/Z4uB5KIPPo4UQ4h1

Many thanks in advance!

UPDATE:

Finally, my data is more complex than I was imagining. In the same data some characters were preserved as “é”, “à”, “ç” etc. and some other characters was converted to HTM entities. So I can have something like this:

{
    "someData1":
    {
        "data1":
        [
            "ecclésiastique"
        ],
        "data2": "s&amp;eacute;culiers"
    },
    "someData2":
    [
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        },
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        }
    ]
}

So, I suppose I have to 1) Convert all data to HTML entities; 2) Convert all HTML entities back to characters…

There is my current code:

# Get data

$jsonData = '{
    "someData1":
    {
        "data1":
        [
            "ecclésiastique"
        ],
        "data2": "s&amp;eacute;culiers"
    },
    "someData2":
    [
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        },
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        }
    ]
}';

$data = json_decode($jsonData, true);

# Convert character encoding

$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");

# Convert HTML entities to their corresponding characters

function html_decode(&$item){
    $item = html_entity_decode($item);
}

array_walk_recursive($data, 'html_decode');

var_dump ($data);

So, I succeed in reversing the encoding. These who was an HTML entities become special characters, and those who was a special character become HTML entities.

But I don't have any idea how to get only special characters.

Online test: https://www.tehplayground.com/bVo3Jr5O7L9p4MXX

There is the solution. I needed to

  1. convert &amp; to & to standardize encoding systems;
  2. convert all applicable characters to HTML entities.

There is the final code. Many thanks to all for all your comments and suggestions.

# Get data

$jsonData = '{
    "someData1":
    {
        "data1":
        [
            "ecclésiastique"
        ],
        "data2": "s&amp;eacute;culiers"
    },
    "someData2":
    [
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        },
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        }
    ]
}';

$data = json_decode($jsonData, true);

# Replace &amp; by &

array_walk_recursive($data, function(&$item, $key) {
    if(is_string($item)) {
        $item = str_replace("&amp;", "&", $item);
    }
});


# Convert HTML entities to their corresponding characters

array_walk_recursive($data, function(&$item, $key) {
    if(is_string($item)) {
        $item = html_entity_decode($item);
    }
});

 var_dump ($data);

Online test: https://www.tehplayground.com/ms1KxR2tywOxIS9J

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM