简体   繁体   中英

How to grep string with regex in Perl?

I am new to Perl and I want write a simple script which will be getting the webpage content via LSW::Simple get() and then I want it to grep in the get() result for some regex match. Here is my code:

$content = get("http://pl.wikipedia.org/wiki/$arg1");
my $result = grep(/en\.wikipedia\.org\/wiki\/[A-Za-z]+\"\s*title/, $content);
print $result;

When I print the result it is "1". How can I get the String which is hidden there: 'en.wikipedia.org/wiki/TextIWantToGet" title'?

Thanks in advance!

What I would do using your base code :

use strict; use warnings;
use LWP::UserAgent;
use HTTP::Request;

my $arg1 = "Rower";

# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;

my $content = $res->content;

$content =~ /en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/;
print $1;

But parsing HTML with regex are discouraged, instead, going further & learn how to use HTML::TreeBuilder::XPath using :

use strict; use warnings;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
use HTTP::Request;

my $arg1 = "Rower";

# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content( $res->content );

# Using XPath, searching for all links having a 'title' attribute
# and having a 'href' attribute matching 'en.wikipedia.org' 
my $link = $tree->findvalue(
    '//a[@title]/@href[contains(., "en.wikipedia.org")]'
);
$link =~ s!.*/!!;
print "$link\n";

Just for fun, this is a concise version using WWW::Mechanize :

use strict; use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;

my $m = WWW::Mechanize->new( autocheck => 1 );
$m->get("http://pl.wikipedia.org/wiki/$ARGV[0]");
my $tree = HTML::TreeBuilder::XPath->new_from_content( $m->content );

print join "\n", map { s!.*/!!; $_ } $tree->findvalues(
    '//a[@title]/@href[contains(., "en.wikipedia.org")]'
);

You need to wrap $result in brackets to force list context instead of scalar context. The Perl documentation for grep says

"Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value consisting of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true."

So you need to use something like

my ($result) = grep(/en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/, $content);

However it really depends which part of the html your actually interested in? the end of the URL? or the title of the page?

the above code will grab anything after /wiki/ which is upper or lowercase AZ thats all that should be in the $result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM