简体   繁体   中英

HTTP Response content type different on HEAD request

I have written simple to code to get the content-type of a given URL. To make the processing faster, I made a change to set the request method as HEAD

// Added a random puppy face picture here 
// On entering this query in browser (or Poster<mozilla> or Postman<chrome>), the
// content type is shown as image/jpeg

URL url = new URL("http://www.bubblews.com/assets/images/news/521013543_1385596410.jpg");    

HttpURLConnection connection = (HttpURLConnection) url
        .openConnection();
connection.setRequestMethod("HEAD");
connection.connect();
String contentType = connection.getContentType();
System.out.println(contentType);
if (!contentType.contains("text/html")) {
    System.out.println("NOT TEXT/HTML");
    // Do something
}

I am trying to achieve something if it is not text/html , but when I set the request method as HEAD , the content-type is shown as text/html . If I fire the same HEAD request using Poster or Postman , I see the content-type as image/jpeg .

So what is it that makes the content-type change in case of this Java code?. Can someone please point out any mistake that I may have made?

Note: I used this post as reference

You should probably add an Accept header and/or User-Agent header.

Most web servers deliver different content depending on headers set by the client (eg web browser, Java HttpURLConnection, curl, ...). This is especially true for Accept , Accept-Encoding , Accept-Language , User-Agent , Cookie and Referer .

As an example, a web-server might refuse to deliver an image, if the Referer header does not link to an internal page. In your case, the web-server doesn't deliver images if it seems like some robot is crawling it. So if you fake your request like if it's coming from a web-browser, the server might deliver it.

When crawling web-sites, you should respect robots.txt (because you act like a robot). So strictly speaking you should be careful when faking User-Agent when doing a lot of requests or create a big business out of this. I don't know how big web-sites react on such behavior, especially when someone by-passes there business...

Please don't see this as a telling-off. I just wanted to point you to this, so you don't run into trouble. Maybe it's not a problem at all, YMMV.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM