简体   繁体   中英

java.net.URL retrieves unintelligible stream

I have been using a java code to retrieve an url content. The code does not work for https://www.amazon.es/ . A similar python code does achieve retrieving an amazon url content.

The java code:

URL url = new URL(urlToScan);
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder();
for (String temp = reader.readLine(); temp != null; temp = reader.readLine())
    builder.append(temp);
webpage = builder.toString();

The python code:

from urllib.request import urlopen
url = "https://www.amazon.es/"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

I searched amazon's html on my own looking for the used charset (in case it was a charset issue) and they are using charset="utf-8" .

As the html is 22,000+ lines long, I thought it could be some parsing error for long Strings. I also tried with a ByteArrayOutputStream and then instancing using String(byte[], charset) constructor.

Java output:

?

Why is not java.net.URL retrieving properly the url content?

Maybe it's because of User-Agent . To set User-Agent , using URLConnection :

URL url = new URL("https://www.amazon.es/");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");

BufferedInputStream bufferedInputStream = new BufferedInputStream(connection.getInputStream());
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(bufferedInputStream));

StringBuilder buffer = new StringBuilder();
String inputLine;
while ((inputLine = bufferedReader.readLine()) != null) {
    buffer.append(inputLine).append("\n");
}
bufferedReader.close();

System.out.println(buffer.toString());

While Python's urllib should be using certain User-Agent by default.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM