简体   繁体   中英

Is it possible to abort a HTTP request depending on the `Content-Type` when using Perl's LWP?

I want my script to download only text/html content and not binary or images that could take significantly more time to download. I know about the max_size parameter but I would like to add a check on the Content-Type header. Is this doable ?

As pointed out by others you can perform a HEAD request before your GET request. You ought to do this as a way of being polite to the server because it actually is easy for you to abort the connection, but not necessarily easy for the web server to abort sending a bunch of data and doing a bunch of work on its end.

There are some different ways to do this depending on how sophisticated you want to be.

  1. You can send an Accept header with your request which only lists text/html . A well-implemented HTTP server will return a 406 Not Acceptable status if you say you don't accept whatever it is the file is. Of course, they might send it to you anyway. You can do this as your HEAD request as well.

  2. When using a recent version of LWP::UserAgent , you can use a handler subroutine to abort the rest of the request after the headers and before the content body.

     use LWP::UserAgent; use Try::Tiny; my $ua = LWP::UserAgent->new; $ua->add_handler( response_header => sub { my($response, $ua, $h) = @_; die "Not HTML" unless $response->content_type eq 'text/html'; }); my $url = "http://example.com/foo"; my $html; my $head_response = $ua->head($url, Accept => "text/html"); if ($head_response->is_success) { my $get_response = $ua->get($url, Accept => "text/html"); if ($get_response->is_success) { $html = $get_response->content; } } 

See the Handlers section of the LWP::UserAgent documentation for details on handlers.

I haven't caught the exception thrown or made sure to deal with the 406 responses carefully here. I leave that as an exercise for the reader.

You can use the HEAD request to query the URI's header info. If the server responds to heads, you'll get everything that a GET would have returned, except for that pesky body.

You can then decide what to do based on the MIME type.

otherwise, you'll have to rely on the file's extension, before you request it.

If you are using the minimal LWP::Simple subclass of LWP then the head function returns the content type as the first element of a list.

So you can write

use strict;
use warnings;

use LWP::Simple;

for my $url ('http://www.bbc.co.uk') {
  my ($ctype) = head $url;
  my $content = get $url if $ctype eq 'text/html';
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM