简体   繁体   中英

How can I download a complete webpage with java without having “ ” replace parts of the HTML code?

I've been writing some code that goes to a website and copies the HTML code to a text file. The problem is that some of the code gets replaced with "&nbsp". This is the code I'm using:

public void addRecords() throws IOException{

    URL google = new URL("Insert Website Here");
    BufferedReader in = new BufferedReader(
            new InputStreamReader(google.openStream()));

    String inputLine;
    while ((inputLine = in.readLine()) != null){
        System.out.println(inputLine);
        z.format("%s \n ", (inputLine));
    }
    in.close();
}
  1. Read the web page into a contiguous buffer.
  2. Replace " " with " ".
  3. Write to the text file.

Option 2

  1. Read the web page (as you are now).
  2. Get one line of the web page.
  3. Replace " " with " ".
  4. Write one line of the web page.
  5. If more lines, goto step 1.

There are many HTML entities of the form &...; that in the browser are shown as special characters. You can even have free numbers, character codes: &8233; .

There is an Apache library commons lang with similar unescape functions:

html = StringEscapeUtils.unescapeHtml4(html);

You can try something like this:

System.out.println(inputLine.replaceAll(" "," "));

OBS > Note that your HTML page maybe will contain another characters escapes, so this solution will be not so good to reuse.

You can refer to commons lang Apache project as seen here in this post: Replace HTML codes with equivalent characters in Java

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM