简体   繁体   中英

exclude words those may or may not end with slash

I am trying to exclude certain words from dictionary file.

# cat en.txt
test
testing
access/p
batch
batch/n
batches
cross

# cat exclude.txt
test
batch

# grep -vf exclude.txt en.txt
access/p
cross

The words like "testing" and "batches" should be included in the results.

expected result:
testing
access/p
batches
cross

Because the word "batch" may or may not be followed by a slash "/". There can be one or more tags after slash (n in this case). But the word "batches" is a different word and should not match with "batch".

I would harness GNU AWK for this task following way, let en.txt content be

test
testing
access/p
batch
batch/n
batches
cross

and exclude.txt content be

test
batch

then

awk 'BEGIN{FS="/"}FNR==NR{arr[$1];next}!($1 in arr)' exclude.txt en.txt

gives output

testing
access/p
batches
cross

Explanation: I inform GNU AWK that / is field separator ( FS ), then when processing first file (where number of row globally is equal to number of row inside file, that is FNR==NR ) I simply use 1st column value as key in array arr and then go to next line, so nothing other happens, for 2nd (and following files if present) I select lines whose 1st column is not ( ! ) one of keys of array arr .

(tested in GNU Awk 5.0.1)

Since there are many words in a dictionary that may have a root in one of those to exclude we cannot conveniently use a look-up hash (built of the exclude list), but have to check all of them. One way to do that more efficiently is to use an alternation pattern built from the exclude list

use warnings;
use strict;
use feature 'say';
use Path::Tiny;  # to read ("slurp") a file conveniently

my $excl_file = 'exclude.txt';

my $re_excl = join '|', split /\n/, path($excl_file)->slurp;
$re_excl = qr($re_excl);

while (<>) { 
    if ( m{^ $re_excl (?:/.)? $}x )  {   
        # say "Skip printing (so filter out): $_";
        next;
    }
    say;
}

This is used as program.pl dictionary-filename and it prints the filtered list.

Here I've assumed that what may follow the root-word to exclude is / followed by one character, (?:/.)? , since examples in the question use that and there is no precise statement on it. The pattern also assumes no spaces around the word.

Please adjust as/if needed for what may actually follow / . For example, it'd be (?:/.+)? for at least one character, (?:/[np])? for any character from a specific list ( n or p ), (?:[^xy]+)? for any characters not in the given list, etc.

The qr operator forms a proper regex pattern.


Can still first strip non-word endings, then use a look-up, then put back those endings

use Path::Tiny;  # to read a file conveniently

my %lu = map { $_ => 1 } path($excl_file)->lines({ chomp => 1 });

while (<>) { 
    chomp;

    # [^\w-] protects hyphenated words (or just use \W)
    # Or: s{(/.+$}{}g;  if "/" is the only possibility
    s/([^\w-].+)$//g;

    next if exists $lu{$_};

    $_ .= $1 if $1; 
    say;
}

This will be far more efficient, on large dictionaries and long lists of exclude words.

However, it is far more complex and probably fails some (unstated) requirements

Using grep matching whole words:

grep -wvf exclude.txt en.txt

Explanation (from man grep)

  • -w --word-regexp Select only those lines containing matches that form whole words.
  • -v --invert-match Invert the sense of matching, to select non-matching lines.
  • -f -f FILE Obtain patterns from FILE, one per line.

Output

testing
access/p
batches
cross

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM