繁体   English   中英

将.htm文件/ URL解析为.xml文件

[英]parse .htm file/url into .xml file

我正在尝试使用JTidy将.htm网页转换为.xml文件,并且需要在.xml文件中提取一些数据/锚元素。 但是,在执行转换步骤时,它始终会生成一个错误文件,并告诉我Warning: unknown attribute Warning: <title> isn't allowed in <body> elements (生成的错误文件中的警告)中Warning: <title> isn't allowed in <body> elements Warning: unknown attributeWarning: <title> isn't allowed in <body> elements

private String url; 
private String outFileName; 
private String errOutFileName; 

public Test(String url, String outFileName, String errOutFileName) { 
    this.url = url; 
    this.outFileName = outFileName; 
    this.errOutFileName = errOutFileName; 
}
public void convert() { 
    URL u; 
    BufferedInputStream in; 
    FileOutputStream out; 

    Tidy tidy = new Tidy(); 

    tidy.setXmlOut(true); 

    try { 
        //Set file for error messages
        tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true)); 
        u = new URL(url); 

        //input and output streams
        in = new BufferedInputStream(u.openStream()); 
        out = new FileOutputStream(outFileName); 

        //Convert files
        tidy.parse(in, out); 

        in.close();
        out.close();

    } catch (IOException e) { 
        System.out.println(this.toString() + e.toString()); 
    } 
} 

public static void main(String[] args) {
    // Test(url address, correctOutput file directory, errorOuput file)
    Test t = new Test("here is the http.....", "e:/...../correctOutput.xml", "e:/...../errorOutput.xml");
    t.convert();
}

非常感谢您的帮助,还有没有更好的方法来实现呢? 如果提供一些详细的代码,我们将不胜感激。

您可以使用XSLT对其进行转换http://www.w3schools.com/xml/xml_xsl.asp

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM