简体   繁体   English

关于Perl通过HTTP在线阅读网页

[英]About Perl reading the webpage online via HTTP

I have a huge webpage, which is about 5G size. 我有一个庞大的网页,大约5G大小。 And I hope I could read the content of the webpage directly(remotely) without downloading the whole file. 我希望我可以直接(远程)阅读网页内容,而无需下载整个文件。 I have used the Open File Handler to open the HTTP content. 我已经使用了“ Open文件处理程序”来打开HTTP内容。 But the error message given is No such files or directory . 但是给出的错误信息是“ No such files or directory I tried to use LWP::Simple , but it was out of memory if I use get the whole content. 我尝试使用LWP::Simple ,但是如果我使用get全部内容,则内存不足。 I wonder if there is a way that I could open this content remotely, and read line by line. 我想知道是否有一种方法可以远程open此内容并逐行读取。 Thank you for your help. 谢谢您的帮助。

You could try using LWP::UserAgent. 您可以尝试使用LWP :: UserAgent。 The request method allows you to specify a CODE reference, which would let you process the data as it's coming in. request方法允许您指定CODE引用,该引用可让您在传入数据时对其进行处理。

#!/usr/bin/perl -w

use strict;
use warnings;

use LWP::UserAgent ();
use HTTP::Request ();

my $request = HTTP::Request->new(GET => 'http://www.example.com/');
my $ua = LWP::UserAgent->new();

$ua->request($request, sub {
        my ($chunk, $res) = @_;
        print $chunk;
        return undef;
});

Technically the function should return the content instead of undef, but it seems to work if you return undef. 从技术上讲,该函数应该返回内容而不是undef,但是如果您返回undef,它似乎可以工作。 According to the documentation: 根据文档:

The "content" function should return the content when called. “内容”功能在调用时应返回内容。 The content function will be invoked repeatedly until it return an empty string to signal that there is no more content. 内容函数将被重复调用,直到返回一个空字符串以表明没有更多内容。

I haven't tried this on a large file, and you would need to write your own code to handle the data coming in as arbitrarily sized chunks. 我没有在大文件上尝试过此操作,因此您需要编写自己的代码来处理以任意大小的块形式传入的数据。

This Perl code will download file from URL with possible continuation if file was already partially downloaded. 如果文件已经部分下载,则此Perl代码将从URL下载文件,并可能继续。

This code requires that server returns file size (aka content-length ) on HEAD request, and also requires that server supports byte ranges on URL in question. 此代码要求服务器在HEAD请求中返回文件大小(即content-length ),并且还要求服务器在所讨论的URL上支持字节范围。

If you want some special processing for next chunk, just override it below: 如果要对下一块进行特殊处理,请在下面进行覆盖:

use strict;
use LWP::UserAgent;
use List::Util qw(min max);

my $url  = "http://example.com/huge-file.bin";
my $file = "huge-file.bin";

DownloadUrl($url, $file);

sub DownloadUrl {
    my ($url, $file, $chunksize) = @_;
    $chunksize ||= 1024*1024;
    my $ua = new LWP::UserAgent;
    my $res = $ua->head($url);
    my $size = $res->headers()->{"content-length"};
    die "Cannot get size for $url" unless defined $size;
    open FILE, ">>$file" or die "ERROR: $!";      
    for (;;) {
        flush FILE;
        my $range1 = -s FILE;        
        my $range2 = min($range1 + $chunksize, $size);
        last if $range1 eq $range2;
        $res = $ua->get($url, Range => "bytes=$range1-$range2");
        last unless $res->is_success();
        # process next chunk:
        print FILE $res->content();
    }
    close FILE;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM