I am trying to exclude certain words from dictionary file.
# cat en.txt
test
testing
access/p
batch
batch/n
batches
cross
# cat exclude.txt
test
batch
# grep -vf exclude.txt en.txt
access/p
cross
The words like "testing" and "batches" should be included in the results.
expected result:
testing
access/p
batches
cross
Because the word "batch" may or may not be followed by a slash "/". There can be one or more tags after slash (n in this case). But the word "batches" is a different word and should not match with "batch".
I would harness GNU AWK
for this task following way, let en.txt
content be
test
testing
access/p
batch
batch/n
batches
cross
and exclude.txt
content be
test
batch
then
awk 'BEGIN{FS="/"}FNR==NR{arr[$1];next}!($1 in arr)' exclude.txt en.txt
gives output
testing
access/p
batches
cross
Explanation: I inform GNU AWK
that /
is field separator ( FS
), then when processing first file (where number of row globally is equal to number of row inside file, that is FNR==NR
) I simply use 1st column value as key in array arr
and then go to next
line, so nothing other happens, for 2nd (and following files if present) I select lines whose 1st column is not ( !
) one of keys of array arr
.
(tested in GNU Awk 5.0.1)
Since there are many words in a dictionary that may have a root in one of those to exclude we cannot conveniently † use a look-up hash (built of the exclude list), but have to check all of them. One way to do that more efficiently is to use an alternation pattern built from the exclude list
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # to read ("slurp") a file conveniently
my $excl_file = 'exclude.txt';
my $re_excl = join '|', split /\n/, path($excl_file)->slurp;
$re_excl = qr($re_excl);
while (<>) {
if ( m{^ $re_excl (?:/.)? $}x ) {
# say "Skip printing (so filter out): $_";
next;
}
say;
}
This is used as program.pl dictionary-filename
and it prints the filtered list.
Here I've assumed that what may follow the root-word to exclude is /
followed by one character, (?:/.)?
, since examples in the question use that and there is no precise statement on it. The pattern also assumes no spaces around the word.
Please adjust as/if needed for what may actually follow /
. For example, it'd be (?:/.+)?
for at least one character, (?:/[np])?
for any character from a specific list ( n
or p
), (?:[^xy]+)?
for any characters not in the given list, etc.
The qr operator forms a proper regex pattern.
† Can still first strip non-word endings, then use a look-up, then put back those endings
use Path::Tiny; # to read a file conveniently
my %lu = map { $_ => 1 } path($excl_file)->lines({ chomp => 1 });
while (<>) {
chomp;
# [^\w-] protects hyphenated words (or just use \W)
# Or: s{(/.+$}{}g; if "/" is the only possibility
s/([^\w-].+)$//g;
next if exists $lu{$_};
$_ .= $1 if $1;
say;
}
This will be far more efficient, on large dictionaries and long lists of exclude words.
However, it is far more complex and probably fails some (unstated) requirements
Using grep matching whole words:
grep -wvf exclude.txt en.txt
Explanation (from man grep)
-w
--word-regexp
Select only those lines containing matches that form whole words. -v
--invert-match
Invert the sense of matching, to select non-matching lines. -f
-f FILE
Obtain patterns from FILE, one per line. Output
testing
access/p
batches
cross
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.