I have been using a java code to retrieve an url content. The code does not work for https://www.amazon.es/ . A similar python code does achieve retrieving an amazon url content.
The java code:
URL url = new URL(urlToScan);
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder();
for (String temp = reader.readLine(); temp != null; temp = reader.readLine())
builder.append(temp);
webpage = builder.toString();
The python code:
from urllib.request import urlopen
url = "https://www.amazon.es/"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
I searched amazon's html on my own looking for the used charset (in case it was a charset issue) and they are using charset="utf-8"
.
As the html is 22,000+ lines long, I thought it could be some parsing error for long Strings. I also tried with a ByteArrayOutputStream
and then instancing using String(byte[], charset)
constructor.
Java output:
?
Why is not java.net.URL retrieving properly the url content?
Maybe it's because of User-Agent
. To set User-Agent
, using URLConnection
:
URL url = new URL("https://www.amazon.es/");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
BufferedInputStream bufferedInputStream = new BufferedInputStream(connection.getInputStream());
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(bufferedInputStream));
StringBuilder buffer = new StringBuilder();
String inputLine;
while ((inputLine = bufferedReader.readLine()) != null) {
buffer.append(inputLine).append("\n");
}
bufferedReader.close();
System.out.println(buffer.toString());
While Python's urllib
should be using certain User-Agent
by default.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.