I've been writing some code that goes to a website and copies the HTML code to a text file. The problem is that some of the code gets replaced with " ". This is the code I'm using:
public void addRecords() throws IOException{
URL google = new URL("Insert Website Here");
BufferedReader in = new BufferedReader(
new InputStreamReader(google.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null){
System.out.println(inputLine);
z.format("%s \n ", (inputLine));
}
in.close();
}
Option 2
There are many HTML entities of the form &...;
that in the browser are shown as special characters. You can even have free numbers, character codes: &8233;
.
There is an Apache library commons lang with similar unescape functions:
html = StringEscapeUtils.unescapeHtml4(html);
You can try something like this:
System.out.println(inputLine.replaceAll(" "," "));
OBS > Note that your HTML page maybe will contain another characters escapes, so this solution will be not so good to reuse.
You can refer to commons lang Apache project as seen here in this post: Replace HTML codes with equivalent characters in Java
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.