如何使用PHP将HTML转换为JSON？

Question

I can convert JSON to HTML using JsontoHtml library. 我可以使用JsontoHtml库将JSON转换为HTML。 Now,I need to convert present HTML to JSON as shown in this site. 现在，我需要将当前的HTML转换为JSON，如本站所示。 When looked into the code I found the following script: 查看代码时，我发现了以下脚本：

<script>
$(function(){

    //HTML to JSON
    $('#btn-render-json').click(function() {

        //Set html output
        $('#html-output').html( $('#html-input').val() );

        //Process to JSON and format it for consumption
        $('#html-json').html( FormatJSON(toTransform($('#html-output').children())) );
    });

});

//Convert obj or array to transform
function toTransform(obj) {

    var json;

    if( obj.length > 1 )
    {
        json = [];

        for(var i = 0; i < obj.length; i++)
            json[json.length++] = ObjToTransform(obj[i]);
    } else
        json = ObjToTransform(obj);

    return(json);
}

//Convert obj to transform
function ObjToTransform(obj)
{
    //Get the DOM element
    var el = $(obj).get(0);

    //Add the tag element
    var json = {'tag':el.nodeName.toLowerCase()};

    for (var attr, i=0, attrs=el.attributes, l=attrs.length; i<l; i++){
        attr = attrs[i];
        json[attr.nodeName] = attr.value;
    }

    var children = $(obj).children();

    if( children.length > 0 ) json['children'] = [];
    else json['html'] = $(obj).text();

    //Add the children
    for(var c = 0; c < children.length; c++)
        json['children'][json['children'].length++] = toTransform(children[c]);

    return(json);
}

//Format JSON (with indents)
function FormatJSON(oData, sIndent) {
    if (arguments.length < 2) {
        var sIndent = "";
    }
    var sIndentStyle = "  ";
    var sDataType = RealTypeOf(oData);

    // open object
    if (sDataType == "array") {
        if (oData.length == 0) {
            return "[]";
        }
        var sHTML = "[";
    } else {
        var iCount = 0;
        $.each(oData, function() {
            iCount++;
            return;
        });
        if (iCount == 0) { // object is empty
            return "{}";
        }
        var sHTML = "{";
    }

    // loop through items
    var iCount = 0;
    $.each(oData, function(sKey, vValue) {
        if (iCount > 0) {
            sHTML += ",";
        }
        if (sDataType == "array") {
            sHTML += ("\n" + sIndent + sIndentStyle);
        } else {
            sHTML += ("\"" + sKey + "\"" + ":");
        }

        // display relevant data type
        switch (RealTypeOf(vValue)) {
            case "array":
            case "object":
                sHTML += FormatJSON(vValue, (sIndent + sIndentStyle));
                break;
            case "boolean":
            case "number":
                sHTML += vValue.toString();
                break;
            case "null":
                sHTML += "null";
                break;
            case "string":
                sHTML += ("\"" + vValue + "\"");
                break;
            default:
                sHTML += ("TYPEOF: " + typeof(vValue));
        }

        // loop
        iCount++;
    });

    // close object
    if (sDataType == "array") {
        sHTML += ("\n" + sIndent + "]");
    } else {
        sHTML += ("}");
    }

    // return
    return sHTML;
}

//Get the type of the obj (can replace by jquery type)
function RealTypeOf(v) {
  if (typeof(v) == "object") {
    if (v === null) return "null";
    if (v.constructor == (new Array).constructor) return "array";
    if (v.constructor == (new Date).constructor) return "date";
    if (v.constructor == (new RegExp).constructor) return "regex";
    return "object";
  }
  return typeof(v);
}
</script>

在此输入图像描述

Now, I am in need of using the following function in PHP. 现在，我需要在PHP中使用以下函数。 I can get the HTML data. 我可以获取HTML数据。 All what I needed now is to convert the JavaScript function to PHP function. 我现在需要的只是将JavaScript函数转换为PHP函数。 Is this possible? 这可能吗？ My major doubts are as follows: 我的主要疑虑如下：

The primary input for the Javascript function toTransform() is an object. Javascript函数toTransform()的主要输入是一个对象。 Is it possible to convert HTML to object via PHP? 是否可以通过PHP将HTML转换为对象？
Are all the functions present in this particular JavaScript available in PHP? PHP中是否提供了此特定JavaScript中的所有函数？

Please suggest me the idea. 请建议我这个想法。

When I tried to convert script tag to json as per the answer given, I get errors. 当我尝试根据给出的答案将脚本标记转换为json时，我会收到错误。 When I tried it in json2html site, it showed like this: 当我在json2html网站上尝试它时，它显示如下： 在此输入图像描述 .. How to achieve the same solution? ..如何实现相同的解决方案？

Answer 1

If you are able to obtain a DOMDocument object representing your HTML, then you just need to traverse it recursively and construct the data structure that you want. 如果您能够获得表示HTML的DOMDocument对象，那么您只需要递归遍历它并构建所需的数据结构。

Converting your HTML document into a DOMDocument should be as simple as this: 将HTML文档转换为DOMDocument应该像这样简单：

function html_to_obj($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    return element_to_obj($dom->documentElement);
}

Then, a simple traversal of $dom->documentElement which gives the kind of structure you described could look like this: 然后， $dom->documentElement的简单遍历给出了你描述的结构类型：

function element_to_obj($element) {
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

Test case 测试用例

$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
    <head>
        <title> This is a test </title>
    </head>
    <body>
        <h1> Is this working? </h1>  
        <ul>
            <li> Yes </li>
            <li> No </li>
        </ul>
    </body>
</html>

EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output 产量

{
    "tag": "html",
    "lang": "en",
    "children": [
        {
            "tag": "head",
            "children": [
                {
                    "tag": "title",
                    "html": " This is a test "
                }
            ]
        },
        {
            "tag": "body",
            "html": "  \n        ",
            "children": [
                {
                    "tag": "h1",
                    "html": " Is this working? "
                },
                {
                    "tag": "ul",
                    "children": [
                        {
                            "tag": "li",
                            "html": " Yes "
                        },
                        {
                            "tag": "li",
                            "html": " No "
                        }
                    ],
                    "html": "\n        "
                }
            ]
        }
    ]
}

Answer to updated question 回答更新的问题

The solution proposed above does not work with the <script> element, because it is parsed not as a DOMText , but as a DOMCharacterData object. 上面提出的解决方案不适用于<script>元素，因为它不是作为DOMText解析，而是作为DOMCharacterData对象解析。 This is because the DOM extension in PHP is based on libxml2 , which parses your HTML as HTML 4.0, and in HTML 4.0 the content of <script> is of type CDATA and not #PCDATA . 这是因为PHP中的DOM扩展基于libxml2 ，它将HTML解析为HTML 4.0，而在HTML 4.0中， <script>的内容是CDATA类型，而不是#PCDATA 。

You have two solutions for this problem. 您有两个解决此问题的方法。

The simple but not very robust solution would be to add the LIBXML_NOCDATA flag to DOMDocument::loadHTML . 简单但不是非常健壮的解决方案是将LIBXML_NOCDATA标志添加到DOMDocument::loadHTML 。 (I am not actually 100% sure whether this works for the HTML parser.) （我实际上并不是100％确定这是否适用于HTML解析器。）
The more difficult but, in my opinion, better solution, is to add an additonal test when you are testing $subElement->nodeType before the recursion. 更困难的是，在我看来，更好的解决方案是在递归之前测试$subElement->nodeType时添加一个额外的测试。 The recursive function would become: 递归函数将变为：

function element_to_obj($element) {
    echo $element->tagName, "\n";
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        elseif ($subElement->nodeType == XML_CDATA_SECTION_NODE) {
            $obj["html"] = $subElement->data;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

If you hit on another bug of this type, the first thing you should do is check the type of node $subElement is, because there exists many other possibilities my short example function did not deal with. 如果你遇到这种类型的另一个bug，你应该做的第一件事是检查节点$subElement的类型，因为我的短示例函数没有处理许多其他可能性。

Additionally, you will notice that libxml2 has to fix mistakes in your HTML in order to be able to build a DOM for it. 另外，您会注意到libxml2必须修复HTML中的错误才能为其构建DOM。 This is why an <html> and a <head> elements will appear even if you don't specify them. 这就是为什么即使您没有指定它们也会出现<html>和<head>元素。 You can avoid this by using the LIBXML_HTML_NOIMPLIED flag. 您可以使用LIBXML_HTML_NOIMPLIED标志来避免这种情况。

Test case with script 用脚本测试用例

$html = <<<EOF
        <script type="text/javascript">
            alert('hi');
        </script>
EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output 产量

{
    "tag": "html",
    "children": [
        {
            "tag": "head",
            "children": [
                {
                    "tag": "script",
                    "type": "text\/javascript",
                    "html": "\n            alert('hi');\n        "
                }
            ]
        }
    ]
}

Answer 2

I assume that your html string is stored in $html variable. 我假设你的html字符串存储在$html变量中。 So you should do: 所以你应该这样做：

$dom = new DOMDocument();
$dom->loadHTML($html);

foreach($dom->getElementsByTagName('*') as $el){
    $result[] = ["type" => $el->tagName, "value" => $el->nodeValue];
}

$json = json_encode($result, JSON_UNESCAPED_UNICODE);

Note : This algorithm doesn't support parent-child tags and fetch all tags as parent elements and parses all of them in a sorted queue. 注意：此算法不支持父子标记，并将所有标记作为父元素获取，并在排序队列中解析所有标记。 Of course, you can implement this feature by studying the DOMDocument classes features. 当然，您可以通过研究DOMDocument类功能来实现此功能。

如何使用PHP将HTML转换为JSON？

问题描述

2 个解决方案

解决方案1
25 已采纳 2014-05-02 11:49:45

解决方案2
1 2018-05-30 17:17:35

如何使用PHP将HTML转换为JSON？

问题描述

2 个解决方案

解决方案1 25 已采纳 2014-05-02 11:49:45

解决方案2 1 2018-05-30 17:17:35

解决方案1
25 已采纳 2014-05-02 11:49:45

解决方案2
1 2018-05-30 17:17:35