[英]parse .htm file/url into .xml file
I am trying to transform a .htm webpage into .xml file using JTidy
and will need to extract some data/anchor element in .xml file. 我正在尝试使用JTidy
将.htm网页转换为.xml文件,并且需要在.xml文件中提取一些数据/锚元素。 However, when doing the transforming step, it always results in a error file and tells me Warning: unknown attribute
and Warning: <title> isn't allowed in <body> elements
(the warnings in generated error file). 但是,在执行转换步骤时,它始终会生成一个错误文件,并告诉我Warning: unknown attribute
Warning: <title> isn't allowed in <body> elements
(生成的错误文件中的警告)中Warning: <title> isn't allowed in <body> elements
Warning: unknown attribute
和Warning: <title> isn't allowed in <body> elements
。
private String url;
private String outFileName;
private String errOutFileName;
public Test(String url, String outFileName, String errOutFileName) {
this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}
public void convert() {
URL u;
BufferedInputStream in;
FileOutputStream out;
Tidy tidy = new Tidy();
tidy.setXmlOut(true);
try {
//Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);
//input and output streams
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);
//Convert files
tidy.parse(in, out);
in.close();
out.close();
} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}
public static void main(String[] args) {
// Test(url address, correctOutput file directory, errorOuput file)
Test t = new Test("here is the http.....", "e:/...../correctOutput.xml", "e:/...../errorOutput.xml");
t.convert();
}
thanks so much for your help and is there any better way to accomplish it? 非常感谢您的帮助,还有没有更好的方法来实现呢? Really appreciate if providing some detailed code. 如果提供一些详细的代码,我们将不胜感激。
您可以使用XSLT对其进行转换http://www.w3schools.com/xml/xml_xsl.asp
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.