简体   繁体   中英

Finding snippets of occurrences of key words in text using Perl

I have text files, say about 200 lines of 15-20 words each. Somewhere in the text there may be "key words" that I'm interested in. I'm trying to use a regular expression in Perl to find these key words, as well as some of the surrounding words (to provide context) and print all the results. (The concept is pretty much the same thing as what Google does when it shows a "snippet" of search results with the words in context.)

The challenge for me is that sometimes the key words appear close together, and I can't figure out how to get the regex to deal with this. I've tried using various combinations of negative lookahead, but so far it's not working right.

For example, suppose the key words are red and green the text file had a bunch of lines with no key words, and then something like this:

Here is some text. One color is red, another is green. Green and red are both colors that are in rainbows but red is at the top and blue is near the bottom.

Assuming I want to keep three words on each side of a "hit" for context, this should give these two snippets:

1) One color is red, another is green. Green and red are both colors

2) in rainbows but red is at the

Any ideas on how to do this?

Here is something that might help you along. Instead of being worried about the amount of words between other words, it's much easier to just work with the number of characters. Similarly, the extra "words" on each side are easier just treated as characters which you later turn into words.

use strict;
use warnings;

my $data = do {local $/; <DATA>};
my @words = qw(red green);

my $words_re = '\b(?:' . join('|', map quotemeta, @words) . ')\b';

while ($data =~ m{
    (.{0,20})   # Prefix
    ($words_re (?:.{1,20} $words_re)*)   # Keyword match, with another keyword w/i 20 characters
    (.{0,20})   # Postfix
}xg) {
    my ($prefix, $match, $postfix) = ($1, $2, $3);

    # Reduce prefix and postfix to just 3 words
    $prefix = reverse $prefix;
    for ($prefix, $postfix) {
        s/^(\S*(?:\s+\S+){0,3}).*/$1/;
    }
    $prefix = reverse $prefix;

    print "$prefix$match$postfix\n";
}

__DATA__
Here is some text. One color is red, another is green. Green and red are both colors that are in rainbows but red is at the top and blue is near the bottom.

Outputs:

One color is red, another is green. Green and red are both colors
in rainbows but red is at the

You may need to play around with the amount characters, but this approach should help you along.

Hmmm, okay, well I tried to get it to work and came up with this. I do not know if this is the best way to go about this, but it seems to work on your sample string at least.

/((\w+\W+){1,3}(red|green)(\W+\w+){1,3})/ig

What I was trying to do with this was to grab a word, followed by a non-word (space, period, comma, etc.) 1-3 times. Then look for the actual words in the expression (red or green). Finally, look for a non-word character followed by a word character 1-3 times.

For the example string, that gives me the following matches at positions:

  • $var[0][1] One·color·is·red,·another·is·green
  • $var[1][1] Green·and·red·are·both·colors
  • $var[2][1] in·rainbows·but·red·is·at·the

So, although this gives me the expected results for red in both instances, green is a little sketchy. The matches as a whole gave my what I wanted, but I'm not so sure if it really is working as expected when looking at it in terms of individual items. I'd be happy to play around with it some more if you have a more data to work with.

Here is a demo for you to review

You can try this code:

#!/usr/bin/perl

use strict;
use warnings;

my $txt = 'Here is some text. One color is red, another is green.'
        . ' Green and red are both colors that are in rainbows but'
        . ' red is at the top and blue is near the bottom.';

while ($txt =~ /(
    (?:                 # words before
        (?!(red|green)) # not followed by the keywords, group 2 is defined
        \b \w+          # a word boundary is needed to not truncate a keyword
        ([\s\pP]+)      # one or more spaces or punct symbols, group 3 is defined
    ){0,3}              # zero to three times (keywords can be at the begining)
    (?2)                # refers to the subpattern in group 2 (keywords)
    (?:                 # if other keywords are met before the 4th word
        (?: (?3) \w+ ){0,2} (?3) (?2)
    )* 
    (?: (?3) \w+ ){0,3} # zero to three words after
               )/gix) {
    print $1 . "\n";
}

In this example, keywords are hardcoded in the pattern, but you can put keywords in an array and join them like in Miller's example. You can use a variable for the number of words too.

The Search::Tools CPAN module is designed to do exactly this kind of snippeting.

Here's an example:

#!perl

use Search::Tools;
use 5.10.0;

my $full_text = "Here is some text. One color is red, another is green. Green and red are both colors that are in rainbows but red is at the top and blue is near the bottom. The color red is Bob's favorite";

my $snipper = Search::Tools->snipper( query         => 'red green',
                                      context       => 10,
                                      ignore_length => 1 );

say $snipper->snip($full_text);

Which returns:

... color is red, another is green. Green and red are both colors ... in rainbows but red is at the top and ... near the bottom. The color red is Bob's favorite ...

The library's clearly documented, and handles edge cases well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM