简体   繁体   中英

how just get url with html type with jsoup

I want to only download sites with content type "text/html" and do not download pdf/mp4/rar... files

for now my code is this:

 Connection connection = Jsoup.connect(linkInfo.getLink()).followRedirects(false).validateTLSCertificates(false).userAgent(USER_AGENT);

 Document htmlDocument = connection.get();

 if (!connection.response().contentType().contains("text/html")) {

     return;
 }

Isn't there any thing like:

Jsoup.connect(linkInfo.getLink()).contentTypeOnly("text/html");

If you mean that you need a way to know if a file is HTML before actually downloading it, then you can use a HEAD request. This will request just the headers, so you can check if it is text/html before actually downloading the file. The method you were using does not really work because you are downloading the file and parsing it as HTML before checking, which will throw an exception on non-HTML files.

Connection connection = Jsoup.connect(linkInfo.getLink())
    .method(Connection.Method.HEAD)
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT);

Connection.Response head = connection.execute();
if (!head.contentType().contains("text/html")) return;

Document html = Jsoup.connect(head.url())
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT)
    .get();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM