简体   繁体   English

如何使用PHP将HTML转换为JSON?

[英]How to convert HTML to JSON using PHP?

I can convert JSON to HTML using JsontoHtml library. 我可以使用JsontoHtml库将JSON转换为HTML。 Now,I need to convert present HTML to JSON as shown in this site. 现在,我需要将当前的HTML转换为JSON,如本站所示。 When looked into the code I found the following script: 查看代码时,我发现了以下脚本:

<script>
$(function(){

    //HTML to JSON
    $('#btn-render-json').click(function() {

        //Set html output
        $('#html-output').html( $('#html-input').val() );

        //Process to JSON and format it for consumption
        $('#html-json').html( FormatJSON(toTransform($('#html-output').children())) );
    });

});

//Convert obj or array to transform
function toTransform(obj) {

    var json;

    if( obj.length > 1 )
    {
        json = [];

        for(var i = 0; i < obj.length; i++)
            json[json.length++] = ObjToTransform(obj[i]);
    } else
        json = ObjToTransform(obj);

    return(json);
}

//Convert obj to transform
function ObjToTransform(obj)
{
    //Get the DOM element
    var el = $(obj).get(0);

    //Add the tag element
    var json = {'tag':el.nodeName.toLowerCase()};

    for (var attr, i=0, attrs=el.attributes, l=attrs.length; i<l; i++){
        attr = attrs[i];
        json[attr.nodeName] = attr.value;
    }

    var children = $(obj).children();

    if( children.length > 0 ) json['children'] = [];
    else json['html'] = $(obj).text();

    //Add the children
    for(var c = 0; c < children.length; c++)
        json['children'][json['children'].length++] = toTransform(children[c]);

    return(json);
}

//Format JSON (with indents)
function FormatJSON(oData, sIndent) {
    if (arguments.length < 2) {
        var sIndent = "";
    }
    var sIndentStyle = "  ";
    var sDataType = RealTypeOf(oData);

    // open object
    if (sDataType == "array") {
        if (oData.length == 0) {
            return "[]";
        }
        var sHTML = "[";
    } else {
        var iCount = 0;
        $.each(oData, function() {
            iCount++;
            return;
        });
        if (iCount == 0) { // object is empty
            return "{}";
        }
        var sHTML = "{";
    }

    // loop through items
    var iCount = 0;
    $.each(oData, function(sKey, vValue) {
        if (iCount > 0) {
            sHTML += ",";
        }
        if (sDataType == "array") {
            sHTML += ("\n" + sIndent + sIndentStyle);
        } else {
            sHTML += ("\"" + sKey + "\"" + ":");
        }

        // display relevant data type
        switch (RealTypeOf(vValue)) {
            case "array":
            case "object":
                sHTML += FormatJSON(vValue, (sIndent + sIndentStyle));
                break;
            case "boolean":
            case "number":
                sHTML += vValue.toString();
                break;
            case "null":
                sHTML += "null";
                break;
            case "string":
                sHTML += ("\"" + vValue + "\"");
                break;
            default:
                sHTML += ("TYPEOF: " + typeof(vValue));
        }

        // loop
        iCount++;
    });

    // close object
    if (sDataType == "array") {
        sHTML += ("\n" + sIndent + "]");
    } else {
        sHTML += ("}");
    }

    // return
    return sHTML;
}

//Get the type of the obj (can replace by jquery type)
function RealTypeOf(v) {
  if (typeof(v) == "object") {
    if (v === null) return "null";
    if (v.constructor == (new Array).constructor) return "array";
    if (v.constructor == (new Date).constructor) return "date";
    if (v.constructor == (new RegExp).constructor) return "regex";
    return "object";
  }
  return typeof(v);
}
</script>

在此输入图像描述

Now, I am in need of using the following function in PHP. 现在,我需要在PHP中使用以下函数。 I can get the HTML data. 我可以获取HTML数据。 All what I needed now is to convert the JavaScript function to PHP function. 我现在需要的只是将JavaScript函数转换为PHP函数。 Is this possible? 这可能吗? My major doubts are as follows: 我的主要疑虑如下:

  • The primary input for the Javascript function toTransform() is an object. Javascript函数toTransform()的主要输入是一个对象。 Is it possible to convert HTML to object via PHP? 是否可以通过PHP将HTML转换为对象?

  • Are all the functions present in this particular JavaScript available in PHP? PHP中是否提供了此特定JavaScript中的所有函数?

Please suggest me the idea. 请建议我这个想法。

When I tried to convert script tag to json as per the answer given, I get errors. 当我尝试根据给出的答案将脚本标记转换为json时,我会收到错误。 When I tried it in json2html site, it showed like this: 当我在json2html网站上尝试它时,它显示如下: 在此输入图像描述 .. How to achieve the same solution? ..如何实现相同的解决方案?

If you are able to obtain a DOMDocument object representing your HTML, then you just need to traverse it recursively and construct the data structure that you want. 如果您能够获得表示HTML的DOMDocument对象,那么您只需要递归遍历它并构建所需的数据结构。

Converting your HTML document into a DOMDocument should be as simple as this: 将HTML文档转换为DOMDocument应该像这样简单:

function html_to_obj($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    return element_to_obj($dom->documentElement);
}

Then, a simple traversal of $dom->documentElement which gives the kind of structure you described could look like this: 然后, $dom->documentElement的简单遍历给出了你描述的结构类型:

function element_to_obj($element) {
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

Test case 测试用例

$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
    <head>
        <title> This is a test </title>
    </head>
    <body>
        <h1> Is this working? </h1>  
        <ul>
            <li> Yes </li>
            <li> No </li>
        </ul>
    </body>
</html>

EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output 产量

{
    "tag": "html",
    "lang": "en",
    "children": [
        {
            "tag": "head",
            "children": [
                {
                    "tag": "title",
                    "html": " This is a test "
                }
            ]
        },
        {
            "tag": "body",
            "html": "  \n        ",
            "children": [
                {
                    "tag": "h1",
                    "html": " Is this working? "
                },
                {
                    "tag": "ul",
                    "children": [
                        {
                            "tag": "li",
                            "html": " Yes "
                        },
                        {
                            "tag": "li",
                            "html": " No "
                        }
                    ],
                    "html": "\n        "
                }
            ]
        }
    ]
}

Answer to updated question 回答更新的问题

The solution proposed above does not work with the <script> element, because it is parsed not as a DOMText , but as a DOMCharacterData object. 上面提出的解决方案不适用于<script>元素,因为它不是作为DOMText解析,而是作为DOMCharacterData对象解析。 This is because the DOM extension in PHP is based on libxml2 , which parses your HTML as HTML 4.0, and in HTML 4.0 the content of <script> is of type CDATA and not #PCDATA . 这是因为PHP中的DOM扩展基于libxml2它将HTML解析为HTML 4.0,而在HTML 4.0中, <script>的内容是CDATA类型,而不是#PCDATA

You have two solutions for this problem. 您有两个解决此问题的方法。

  1. The simple but not very robust solution would be to add the LIBXML_NOCDATA flag to DOMDocument::loadHTML . 简单但不是非常健壮的解决方案是将LIBXML_NOCDATA标志添加到DOMDocument::loadHTML (I am not actually 100% sure whether this works for the HTML parser.) (我实际上并不是100%确定这是否适用于HTML解析器。)

  2. The more difficult but, in my opinion, better solution, is to add an additonal test when you are testing $subElement->nodeType before the recursion. 更困难的是,在我看来,更好的解决方案是在递归之前测试$subElement->nodeType时添加一个额外的测试。 The recursive function would become: 递归函数将变为:

function element_to_obj($element) {
    echo $element->tagName, "\n";
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        elseif ($subElement->nodeType == XML_CDATA_SECTION_NODE) {
            $obj["html"] = $subElement->data;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

If you hit on another bug of this type, the first thing you should do is check the type of node $subElement is, because there exists many other possibilities my short example function did not deal with. 如果你遇到这种类型的另一个bug,你应该做的第一件事是检查节点$subElement的类型,因为我的短示例函数没有处理许多其他可能性

Additionally, you will notice that libxml2 has to fix mistakes in your HTML in order to be able to build a DOM for it. 另外,您会注意到libxml2必须修复HTML中的错误才能为其构建DOM。 This is why an <html> and a <head> elements will appear even if you don't specify them. 这就是为什么即使您没有指定它们也会出现<html><head>元素。 You can avoid this by using the LIBXML_HTML_NOIMPLIED flag. 您可以使用LIBXML_HTML_NOIMPLIED标志来避免这种情况。

Test case with script 用脚本测试用例

$html = <<<EOF
        <script type="text/javascript">
            alert('hi');
        </script>
EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output 产量

{
    "tag": "html",
    "children": [
        {
            "tag": "head",
            "children": [
                {
                    "tag": "script",
                    "type": "text\/javascript",
                    "html": "\n            alert('hi');\n        "
                }
            ]
        }
    ]
}

I assume that your html string is stored in $html variable. 我假设你的html字符串存储在$html变量中。 So you should do: 所以你应该这样做:

$dom = new DOMDocument();
$dom->loadHTML($html);

foreach($dom->getElementsByTagName('*') as $el){
    $result[] = ["type" => $el->tagName, "value" => $el->nodeValue];
}

$json = json_encode($result, JSON_UNESCAPED_UNICODE);

Note : This algorithm doesn't support parent-child tags and fetch all tags as parent elements and parses all of them in a sorted queue. 注意 :此算法不支持父子标记,并将所有标记作为父元素获取,并在排序队列中解析所有标记。 Of course, you can implement this feature by studying the DOMDocument classes features. 当然,您可以通过研究DOMDocument类功能来实现此功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM