简体   繁体   English

如何在不使用“”替换部分HTML代码的情况下使用Java下载完整的网页?

[英]How can I download a complete webpage with java without having “ ” replace parts of the HTML code?

I've been writing some code that goes to a website and copies the HTML code to a text file. 我一直在编写一些网站代码,并将HTML代码复制到文本文件中。 The problem is that some of the code gets replaced with "&nbsp". 问题是某些代码被替换为“&nbsp”。 This is the code I'm using: 这是我正在使用的代码:

public void addRecords() throws IOException{

    URL google = new URL("Insert Website Here");
    BufferedReader in = new BufferedReader(
            new InputStreamReader(google.openStream()));

    String inputLine;
    while ((inputLine = in.readLine()) != null){
        System.out.println(inputLine);
        z.format("%s \n ", (inputLine));
    }
    in.close();
}
  1. Read the web page into a contiguous buffer. 将网页读取到连续的缓冲区中。
  2. Replace " " 替换为“ ” with " ". 与“”。
  3. Write to the text file. 写入文本文件。

Option 2 选项2

  1. Read the web page (as you are now). 阅读网页(就像现在一样)。
  2. Get one line of the web page. 获取网页的一行。
  3. Replace " " 替换为“ ” with " ". 与“”。
  4. Write one line of the web page. 编写网页的一行。
  5. If more lines, goto step 1. 如果有更多行,请转到步骤1。

There are many HTML entities of the form &...; 有许多形式为&...; HTML实体&...; that in the browser are shown as special characters. 在浏览器中显示为特殊字符。 You can even have free numbers, character codes: &8233; 您甚至可以使用免费的数字,字符代码: &8233; .

There is an Apache library commons lang with similar unescape functions: 有一个具有类似unescape功能的Apache库公共语言

html = StringEscapeUtils.unescapeHtml4(html);

You can try something like this: 您可以尝试如下操作:

System.out.println(inputLine.replaceAll(" "," "));

OBS > Note that your HTML page maybe will contain another characters escapes, so this solution will be not so good to reuse. OBS >请注意,您的HTML页面可能会包含其他字符转义符,因此此解决方案不太好重用。

You can refer to commons lang Apache project as seen here in this post: Replace HTML codes with equivalent characters in Java 您可以参考本文中在此处看到的common lang Apache项目: 用Java中的等效字符替换HTML代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何提取和存储字符串之间的文本<strong>和</strong> <strong><br></strong> <strong>里面</strong> <p> <strong>标签中没有html代码(例如etc)</strong> - How to extract and store in a string array the text between <strong> and <br> that are inside <p> tag having no html code(i.e &nbsp; etc) in it 如何使用Java从网页下载HTML - How to download html from webpage with java Java Selenium:如何在不首先加载页面的情况下获取网页的HTML? - Java Selenium: how can I get the HTML of a webpage without first loading the page? 在Android中,如何下载网页的HTML? - In Android, how do I download the HTML of a webpage? 如何在网页上下载评论(Android) - How can I download comments on a webpage (Android) 如何获得Java源代码的完整调用层次结构? - How can I get the complete Call Hierarchy of a Java source code? 如何在没有在浏览器中打开页面的情况下使用java单击网页上的超链接 - how do i click a hyperlink on a webpage using java without having the page open in browser 在Java中,如何计算下载期间网页的下载大小? - In Java, how I count the size of the download of a webpage during the download? 如何使用XOM将内容放入内容 - How can I put a &nbsp; into the content with xom 我们如何在不使用硒的情况下从 Java 代码点击网页? - how can we click on webpage from java code without using selenium?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM