简体   繁体   English

使用Perl的LWP时,是否可以根据`Content-Type`中止HTTP请求?

[英]Is it possible to abort a HTTP request depending on the `Content-Type` when using Perl's LWP?

I want my script to download only text/html content and not binary or images that could take significantly more time to download. 我希望我的脚本只下载文本/ html内容,而不是二进制文件或可能需要更多时间下载的图像。 I know about the max_size parameter but I would like to add a check on the Content-Type header. 我知道max_size参数,但我想在Content-Type标头上添加一个检查。 Is this doable ? 这可行吗?

As pointed out by others you can perform a HEAD request before your GET request. 正如其他人所指出的,您可以在GET请求之前执行HEAD请求。 You ought to do this as a way of being polite to the server because it actually is easy for you to abort the connection, but not necessarily easy for the web server to abort sending a bunch of data and doing a bunch of work on its end. 你应该这样做是为了对服务器礼貌,因为它实际上很容易中止连接,但Web服务器不一定容易中止发送大量数据并在其结束时做一堆工作。

There are some different ways to do this depending on how sophisticated you want to be. 根据您想要的复杂程度,有一些不同的方法可以做到这一点。

  1. You can send an Accept header with your request which only lists text/html . 您可以使用您的请求发送一个Accept标头,该标头仅列出text/html A well-implemented HTTP server will return a 406 Not Acceptable status if you say you don't accept whatever it is the file is. 如果您说您不接受文件的任何内容,那么良好实施的HTTP服务器将返回406 Not Acceptable状态。 Of course, they might send it to you anyway. 当然,无论如何,他们可能会发送给你。 You can do this as your HEAD request as well. 您也可以将此作为HEAD请求执行此操作。

  2. When using a recent version of LWP::UserAgent , you can use a handler subroutine to abort the rest of the request after the headers and before the content body. 使用最新版本的LWP :: UserAgent时 ,您可以使用处理程序子例程在标题之后和内容正文之前中止请求的其余部分。

     use LWP::UserAgent; use Try::Tiny; my $ua = LWP::UserAgent->new; $ua->add_handler( response_header => sub { my($response, $ua, $h) = @_; die "Not HTML" unless $response->content_type eq 'text/html'; }); my $url = "http://example.com/foo"; my $html; my $head_response = $ua->head($url, Accept => "text/html"); if ($head_response->is_success) { my $get_response = $ua->get($url, Accept => "text/html"); if ($get_response->is_success) { $html = $get_response->content; } } 

See the Handlers section of the LWP::UserAgent documentation for details on handlers. 有关处理程序的详细信息,请参阅LWP :: UserAgent文档的“处理程序”部分。

I haven't caught the exception thrown or made sure to deal with the 406 responses carefully here. 我没有抓住异常抛出或确保在这里仔细处理406响应。 I leave that as an exercise for the reader. 我把它作为读者的练习。

You can use the HEAD request to query the URI's header info. 您可以使用HEAD请求来查询URI的标头信息。 If the server responds to heads, you'll get everything that a GET would have returned, except for that pesky body. 如果服务器响应头部,你将获得GET将返回的所有内容,除了那个讨厌的主体。

You can then decide what to do based on the MIME type. 然后,您可以根据MIME类型决定要执行的操作。

otherwise, you'll have to rely on the file's extension, before you request it. 否则,在您提出申请之前,您必须依赖文件的扩展名。

If you are using the minimal LWP::Simple subclass of LWP then the head function returns the content type as the first element of a list. 如果您使用的是最小的LWP::Simple的子类LWP ,则head函数返回的内容类型为列表的第一个元素。

So you can write 所以你可以写

use strict;
use warnings;

use LWP::Simple;

for my $url ('http://www.bbc.co.uk') {
  my ($ctype) = head $url;
  my $content = get $url if $ctype eq 'text/html';
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM