[英]Clean up HTML input using HTMLcleaner
我正在使用HTMLCleaner
库编写一个Java项目,并将输出保存为XML文件,这是我编写的代码:
URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "google.com");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
tagNodeRoot , "cleaned.xml", "utf-8"
);
问题在于,运行项目后, cleaned.xml
文件为空。
问题是您尝试访问的页面已配置为重定向到HTTPS。 无论出于何种原因,这都行不通,因此输入流为空。 如果将URL更改为HTTPS,则可以正常工作:
URL urlSB = new URL("https://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:5.0) Gecko/20100101 Firefox/25.0");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
new PrettyXmlSerializer(props).writeToFile(tagNodeRoot, "cleaned.xml", "utf-8");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.