java.net.URL retrieves unintelligible stream

Question

I have been using a java code to retrieve an url content. The code does not work for https://www.amazon.es/ . A similar python code does achieve retrieving an amazon url content.

The java code:

URL url = new URL(urlToScan);
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder();
for (String temp = reader.readLine(); temp != null; temp = reader.readLine())
    builder.append(temp);
webpage = builder.toString();

The python code:

from urllib.request import urlopen
url = "https://www.amazon.es/"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

I searched amazon's html on my own looking for the used charset (in case it was a charset issue) and they are using charset="utf-8" .

As the html is 22,000+ lines long, I thought it could be some parsing error for long Strings. I also tried with a ByteArrayOutputStream and then instancing using String(byte[], charset) constructor.

Java output:

Why is not java.net.URL retrieving properly the url content?

Answer 1

Maybe it's because of User-Agent . To set User-Agent , using URLConnection :

URL url = new URL("https://www.amazon.es/");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");

BufferedInputStream bufferedInputStream = new BufferedInputStream(connection.getInputStream());
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(bufferedInputStream));

StringBuilder buffer = new StringBuilder();
String inputLine;
while ((inputLine = bufferedReader.readLine()) != null) {
    buffer.append(inputLine).append("\n");
}
bufferedReader.close();

System.out.println(buffer.toString());

While Python's urllib should be using certain User-Agent by default.

java.net.URL retrieves unintelligible stream

Question

1 answers

solution1
2 ACCPTED 2021-02-23 17:43:34

java.net.URL retrieves unintelligible stream

Question

1 answers

solution1 2 ACCPTED 2021-02-23 17:43:34

solution1
2 ACCPTED 2021-02-23 17:43:34