简体   繁体   English

为什么我不能将我的已删除的HTML解析为XML?

[英]Why can't I parse my scraped HTML into XML?

I am trying to parse some scraped HTML into valid xml, using this function . 我正在尝试使用此函数将一些已删除的HTML解析为有效的xml。

My test code (with the htmlParse function copied and pasted from Ben Nadel's blog): 我的测试代码(从Ben Nadel的博客复制并粘贴了htmlParse函数):

<cfscript>
    // I take an HTML string and parse it into an XML(XHTML)
    // document. This is returned as a standard ColdFusion XML
    // document.
    function htmlParse( htmlContent, disableNamespaces = true ){

        // Create an instance of the Xalan SAX2DOM class as the
        // recipient of the TagSoup SAX (Simple API for XML) compliant
        // events. TagSoup will parse the HTML and announce events as
        // it encounters various HTML nodes. The SAX2DOM instance will
        // listen for such events and construct a DOM tree in response.
        var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

        // Create our TagSoup parser.
        var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();

        // Check to see if namespaces are going to be disabled in the
        // parser. If so, then they will not be added to elements.
        if (disableNamespaces){

        // Turn off namespaces - they are lame an nobody likes
        // to perform xmlSearch() methods with them in place.
        tagSoupParser.setFeature(
        tagSoupParser.namespacesFeature,
        javaCast( "boolean", false )
        );

        }

        // Set our DOM builder to be the listener for SAX-based
        // parsing events on our HTML.
        tagSoupParser.setContentHandler( saxDomBuilder );

        // Create our content input. The InputSource encapsulates the
        // means by which the content is read.
        var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
        createObject( "java", "java.io.StringReader" ).init( htmlContent )
        );

        // Parse the HTML. This will trigger events which the SAX2DOM
        // builder will translate into a DOM tree.
        tagSoupParser.parse( inputSource );

        // Now that the HTML has been parsed, we have to get a
        // representation that is similar to the XML document that
        // ColdFusion users are used to having. Let's search for the
        // ROOT document and return is.
        return(
        xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
        );

    }
</cfscript>
<cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset parsedData = htmlParse(html) />

(The html is received in this format from a different function, but I tried hardcoding the string for now to trace the problem.) (html是以不同的函数从这种格式接收的,但我现在尝试对字符串进行硬编码以跟踪问题。)

I get the following error: 我收到以下错误:

NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist. 
The error occurred in myfilePath/myfileName.cfm: line 42

40 :        // Parse the HTML. This will trigger events which the SAX2DOM
41 :        // builder will translate into a DOM tree.
42 :        tagSoupParser.parse( inputSource );

What is going wrong? 出了什么问题? How can I correct it? 我该如何纠正?

I haven't used TagSoup but I have been using jTidy for years with great results to take user-provided HTML from all kinds of sources (including MS Word) and clean it up such that it returns XHTML. 我没有使用过TagSoup,但我多年来一直在使用jTidy,取得了很好的效果,可以从各种来源(包括MS Word)中获取用户提供的HTML并清理它以便返回XHTML。

You can try jTidy on the same document by dropping the jTidy jar onto your classpath or using JavaLoader to load it. 您可以通过将jTidy jar放到类路径上或使用JavaLoader加载它来对同一文档尝试jTidy。 Since you're on CF10, you can use this method to include the JAR . 由于您使用的是CF10,因此您可以使用此方法来包含JAR

Then, here's how to call jTidy in cfscript: 然后,这里是如何在cfscript中调用jTidy:

jTidy = createObject("java", "org.w3c.tidy.Tidy");

jTidy.setQuiet(false);
jTidy.setIndentContent(true);
jTidy.setSmartIndent(true);
jTidy.setIndentAttributes(true);
jTidy.setWraplen(1024);
jTidy.setXHTML(true);
jTidy.setNumEntities(true);
jTidy.setConvertWindowsChars(true);             
jTidy.setFixBackslash(true);        // changes \ in urls to /
jTidy.setLogicalEmphasis(true);     // uses strong/em instead of b/i
jTidy.setDropEmptyParas(true);

// create the in and out streams for jTidy
readBuffer = CreateObject("java","java.lang.String").init(parseData).getBytes();
inP = createobject("java","java.io.ByteArrayInputStream").init(readBuffer);
outx = createObject("java", "java.io.ByteArrayOutputStream").init();

// do the parsing
jTidy.parse(inP,outx);
outstr = outx.toString();

This will return valid XHTML which you can query against with XPath. 这将返回有效的XHTML,您可以使用XPath查询。 I wrapped the above into a makeValid() function and then ran it against your HTML: 我将上面的内容包装到makeValid()函数中,然后针对您的HTML运行它:

    <cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset out = makeValid(html) />
<cfdump var="#xmlParse(out)#" />

And here was the output: 这是输出:

来自xmlParse()的cfdump输出的图片

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM