how just get url with html type with jsoup

Question

I want to only download sites with content type "text/html" and do not download pdf/mp4/rar... files

for now my code is this:

 Connection connection = Jsoup.connect(linkInfo.getLink()).followRedirects(false).validateTLSCertificates(false).userAgent(USER_AGENT);

 Document htmlDocument = connection.get();

 if (!connection.response().contentType().contains("text/html")) {

     return;
 }

Isn't there any thing like:

Jsoup.connect(linkInfo.getLink()).contentTypeOnly("text/html");

Answer 1

If you mean that you need a way to know if a file is HTML before actually downloading it, then you can use a HEAD request. This will request just the headers, so you can check if it is text/html before actually downloading the file. The method you were using does not really work because you are downloading the file and parsing it as HTML before checking, which will throw an exception on non-HTML files.

Connection connection = Jsoup.connect(linkInfo.getLink())
    .method(Connection.Method.HEAD)
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT);

Connection.Response head = connection.execute();
if (!head.contentType().contains("text/html")) return;

Document html = Jsoup.connect(head.url())
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT)
    .get();

how just get url with html type with jsoup

Question

1 answers

solution1
2 ACCPTED 2018-06-04 22:56:54

how just get url with html type with jsoup

Question

1 answers

solution1 2 ACCPTED 2018-06-04 22:56:54

solution1
2 ACCPTED 2018-06-04 22:56:54