简体   繁体   中英

perl get multiple lines from text file between pattern

I have a html file that contains data which I have to push to MySql database. I try to parse html file to get values I need in scalars which I got right but I have a problem when I get to the point that I need to collect data not from a single line of text but multiple lines between certain patter. Here is what I got so far that kinda works:

  #!/usr/bin/perl
  binmode STDOUT,':encoding(cp1250)';

  open FILE, "index.html" or die "Could not open $file: $!";
  my $word;
  my $description;
  my $origin;

  while (my $line = <FILE>)
  { 
    if ($line =~ m/(?<=<h2 class=\"featured\">)(.*)(?=<\/h2>)/)
    {
    $word = $line =~ m/<=<h2 class=\"featured\">(.*)<\/h2>/;
    $word = $1;     
    }

    if ($line =~ m/(?<=<h4 class=\"related-posts\">)/)
    {
    print $line;
    $origin = $line =~ m/<h4 class=\"related-posts\"> <a href=\"..\/tag\/lacina\/index.html\" rel=\"tag\">(.*)<\/a><\/h4>/;
    $origin = $1;       
    }


  }

print "$word \n";
print "$origin";

Now I want to grab a few lines of a text - does not have to be in a single scalar but I dont know how many lines there will be. All I know is that the lines are in between of:

<div class="post-content">

<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>

<div class="box small arial">

Plus I would like to get rid of

       <p>'s

I thought of reading a line, storing it in a scaral, reading another line and comparing to the recently saved scalar. But how I supouse to check if I have all I want in that scalar?

use a range operator to find the text between two patterns:

use strict;
use warnings;

while (<DATA>) {
    if (my $range = /<div class="post-content">/ .. /<div class="box small arial">/) {
        next if $range =~ /E/;
        print;
    }
}

__DATA__
<html>
<head><title>stuff</title></head>
<body>
<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
</div>
<div class="box small arial">
</div>
</body>
</html>

Outputs:

<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
</div>

However, the real answer is use an actual HTML Parser for parsing HTML.

I'd recommend Mojo::DOM . For a helpful 8 minute introductory video, check out Mojocast Episode 5 .

use strict;
use warnings;

use Mojo::DOM;

my $data = do {local $/; <DATA>};

my $dom = Mojo::DOM->new($data);

for my $div ($dom->find('div[class=post-content]')->each) {
    print $div->all_text();
}

__DATA__
<html>
<head><title>stuff</title></head>
<body>
<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
</div>
<div class="box small arial">
</div>
</body>
</html>

Outputs:

text I want 1.text I want 2.text I want

Use a tool for the job instead of a regular expression.

use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;

my $tr = HTML::TreeBuilder->new_from_file('index.html');

for my $div ($tr->look_down(_tag => 'div', 'class' => 'post-content')) {
  for my $t ($div->look_down(_tag => 'p')) {
    say $t->as_text;
  }
}

Output

text I want 1.text I want 2.text I want

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM