How can I download a complete webpage with java without having “ ” replace parts of the HTML code?

Question

I've been writing some code that goes to a website and copies the HTML code to a text file. The problem is that some of the code gets replaced with "&nbsp". This is the code I'm using:

public void addRecords() throws IOException{

    URL google = new URL("Insert Website Here");
    BufferedReader in = new BufferedReader(
            new InputStreamReader(google.openStream()));

    String inputLine;
    while ((inputLine = in.readLine()) != null){
        System.out.println(inputLine);
        z.format("%s \n ", (inputLine));
    }
    in.close();
}

Answer 1

Read the web page into a contiguous buffer.
Replace " " with " ".
Write to the text file.

Option 2

Read the web page (as you are now).
Get one line of the web page.
Replace " " with " ".
Write one line of the web page.
If more lines, goto step 1.

Answer 2

There are many HTML entities of the form &...; that in the browser are shown as special characters. You can even have free numbers, character codes: &8233; .

There is an Apache library commons lang with similar unescape functions:

html = StringEscapeUtils.unescapeHtml4(html);

Answer 3

You can try something like this:

System.out.println(inputLine.replaceAll("&nbsp;"," "));

OBS > Note that your HTML page maybe will contain another characters escapes, so this solution will be not so good to reuse.

You can refer to commons lang Apache project as seen here in this post: Replace HTML codes with equivalent characters in Java

How can I download a complete webpage with java without having “ ” replace parts of the HTML code?

Question

3 answers

solution1
1 2016-03-08 17:52:33

solution2
0 2016-03-08 17:59:44

solution3
0 2016-03-08 18:06:17

How can I download a complete webpage with java without having “&nbsp;” replace parts of the HTML code?

Question

3 answers

solution1 1 2016-03-08 17:52:33

solution2 0 2016-03-08 17:59:44

solution3 0 2016-03-08 18:06:17

How can I download a complete webpage with java without having “ ” replace parts of the HTML code?

solution1
1 2016-03-08 17:52:33

solution2
0 2016-03-08 17:59:44

solution3
0 2016-03-08 18:06:17