简体   繁体   English

改进LWP ::简单的perl性能

[英]improving LWP::Simple perl performance

Alas, I have yet another question: 唉,我还有一个问题:

I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). 我的任务是阅读网页并从该页面中提取链接(HTML :: TokeParser简单易用)。 He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read. 他(我的老板)然后坚持要求我从这些链接中读取并从每个页面中获取一些细节,并将所有这些信息解析为xml文件,以后可以读取。

So, I can set this up fairly simply like so: 所以,我可以这样简单地设置它:

#!/usr/bin/perl -w

use     strict;
use     LWP::Simple; 
require HTML::TokeParser; 

$|=1;                        # un buffer

my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx';
my $rawHTML = get($path); # attempt to d/l the page to mem

my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";

open (my $out, "> output.xml") or die;

while (my $token = $p->get_tag("a")) {

    my $url = $token->[1]{href} || "-";

    if ($url =~ /event\.aspx\?eventid=(\d+)/) {
        ( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
        my $text = $p->get_trimmed_text("/a");
        print $out $event_id,"\n";
        print $out $text,"\n";

        my $details = $base.$url;
        my $contents = get($details);

        # now set up another HTML::TokeParser, and parse each of those files.

    }
}

This would probably be OK if there were maybe 5 links on this page. 如果此页面上可能有5个链接,则可能没问题。 However, I'm trying to read from ~600 links, and grab info from each of these pages. 但是,我试图从~600个链接中读取,并从这些页面中获取信息。 So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish. 所以,不用说,我的方法花了很长时间......老实说我不知道​​多久,因为我从来没有让它完成。

It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :) 我的想法是简单地编写一些只能根据需要获取信息的东西(例如,一个从你想要的链接中查找信息的java应用程序).​​.....但是,这似乎是不可接受的,所以我'我转向你们:)

Is there any way to improve on this process? 有没有办法改善这个过程?

You will probably see a speed boost -- at the expense of less simple code -- if you execute your get() s in parallel instead of sequentially. 如果你并行执行你的get()而不是顺序执行,你可能会看到速度提升 - 代价是不那么简单的代码。

Parallel::ForkManager is where I would start (and even includes an LWP::Simple get() example in its documentation), but there are plenty of other alternatives to be found on CPAN , including the fairly dated LWP::Parallel::UserAgent . Parallel :: ForkManager是我要开始的地方(甚至在其文档中包含一个LWP :: Simple get()示例),但在CPAN上还有很多其他替代方法,包括相当陈旧的LWP :: Parallel :: UserAgent

If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. 如果要从服务器获取多个项目并快速执行此操作,请使用TCP Keep-Alive。 Drop the simplistic LWP::Simple and use the regular LWP::UserAgent with the keep_alive option. 删除简单化的LWP::Simple并使用常规LWP::UserAgentkeep_alive选项。 That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host. 这将设置连接缓存,因此当从同一主机获取更多页面时,不会产生TCP连接建立开销。

use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;

my @urls = @ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( @urls ) {
        my $req = HEAD $_;
        print $req->as_string;
        my $rsp = $ua->request( $req );
        print $rsp->as_string;
}

my $cache = $ua->conn_cache;
my @conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket

WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper WWW :: Mechanize是一项伟大的工作,如果您正在查看模块,我还建议使用Web :: Scraper

Both have docs at the links I provided and should help you get going quickly. 两者都在我提供的链接上有文档,应该可以帮助您快速上手。

Your issue is scrapping being more CPU-intensive than I/O-intensive. 您的问题是报废比I / O密集型更耗费CPU。 While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language. 虽然这里的大多数人会建议你使用更多的CPU,但我会尝试展示Perl被用作“粘合”语言的巨大优势。 Everyone agrees that Libxml2 is an excellent XML/HTML parser. 每个人都同意Libxml2是一个出色的XML / HTML解析器。 Also, libcurl is an awesome download agent. 此外, libcurl是一个很棒的下载代理。 However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant). 但是,在Perl Universe中,许多scraper都基于LWP :: UserAgent和HTML :: TreeBuilder :: XPath(类似于HTML :: TokeParser,同时符合XPath)。 In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2: 在这种情况下,您可以使用插件替换模块来处理通过libcurl / libxml2进行的下载和HTML解析:

use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();

I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain. 我看到平均5倍的速度增加只是通过在我曾经维护过的几个刮刀中加上这3条线。 But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper). 但是,当您使用HTML :: TokeParser时,我建议您尝试使用Web :: Scraper :: LibXML(加上LWP :: Protocol :: Net :: Curl,它会影响LWP :: Simple Web: :刮刀)。

There's a good chance it's blocking on the http get request while it waits for the response from the network. 在等待来自网络的响应时,它很可能阻止了http get请求。 Use an asynchronous http library and see if it helps. 使用异步http库 ,看看它是否有帮助。

use strict;
use warnings;

use threads;  # or: use forks;

use Thread::Queue qw( );

use constant MAX_WORKERS => 10;

my $request_q  = Thread::Queue->new();
my $response_q = Thread::Queue->new();

# Create the workers.
my @workers;
for (1..MAX_WORKERS) {
   push @workers, async {
      while (my $url = $request_q->dequeue()) {
         $response_q->enqueue(process_request($url));
      }
   };
}

# Submit work to workers.
$request_q->enqueue(@urls);

# Signal the workers they are done.    
for (1..@workers) {
   $request_q->enqueue(undef);
}

# Wait for the workers to finish.
$_->join() for @workers;

# Collect the results.
while (my $item = $response_q->dequeue()) {
   process_response($item);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM