[英]How can I download a complete webpage with java without having “ ” replace parts of the HTML code?
I've been writing some code that goes to a website and copies the HTML code to a text file. 我一直在编写一些网站代码,并将HTML代码复制到文本文件中。 The problem is that some of the code gets replaced with " ". 问题是某些代码被替换为“&nbsp”。 This is the code I'm using: 这是我正在使用的代码:
public void addRecords() throws IOException{
URL google = new URL("Insert Website Here");
BufferedReader in = new BufferedReader(
new InputStreamReader(google.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null){
System.out.println(inputLine);
z.format("%s \n ", (inputLine));
}
in.close();
}
Option 2 选项2
There are many HTML entities of the form &...;
有许多形式为&...;
HTML实体&...;
that in the browser are shown as special characters. 在浏览器中显示为特殊字符。 You can even have free numbers, character codes: &8233;
您甚至可以使用免费的数字,字符代码: &8233;
. 。
There is an Apache library commons lang with similar unescape functions: 有一个具有类似unescape功能的Apache库公共语言 :
html = StringEscapeUtils.unescapeHtml4(html);
You can try something like this: 您可以尝试如下操作:
System.out.println(inputLine.replaceAll(" "," "));
OBS > Note that your HTML page maybe will contain another characters escapes, so this solution will be not so good to reuse. OBS >请注意,您的HTML页面可能会包含其他字符转义符,因此此解决方案不太好重用。
You can refer to commons lang Apache project as seen here in this post: Replace HTML codes with equivalent characters in Java 您可以参考本文中在此处看到的common lang Apache项目: 用Java中的等效字符替换HTML代码
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.