简体   繁体   中英

Fastest way to find lines of a file from another larger file in Bash

I have two files, file1.txt and file2.txt . file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3 , delimited by | .

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt ).

file1.txt (about 14K lines, not sorted ):

foo1
foo2
...
bar1
bar2
...

file2.txt (about 2 billion lines, not sorted ):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Output expected:

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

A Perl solution. [See Note below.]

Use a hash for the first file. As you read the big file line-by-line, extract the field by regex (captures the first pattern between || ) or split (gets the second word) and print if it exists . They likely differ in speed a bit (time them). The defined check isn't needed in the regex while for split use // (defined-or) that short-circuits.

use warnings;
use strict;

# If 'prog smallfile bigfile' is the preferred use
die "Usage: $0 smallfile bigfile\n"  if @ARGV != 2;
my ($smallfile, $bigfile) = @ARGV;

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";    
my %word = map { chomp; $_ => 1 } <$fh>;

open    $fh, '<', $bigfile or die "Can't open $bigfile: $!";       
while (<$fh>) 
{
    exists $word{ (/\|([^|]+)/)[0] } && print;  

    # Or
    #exists $word{ (split /\|/)[1] // '' } && print;
}
close $fh;

Avoiding the if branch and using short-circuit is faster, but only very little. On billions of lines these tweaks add up but again not to too much. It may (or may not) be a tad bit faster to read the small file line by line, instead of in list context as above, but this should not be noticeable.

Update Writing to STDOUT saves two operations and I repeatedly time it to be a little faster than writing to a file. Such usage is also consistent with most UNIX tools so I changed to write to STDOUT . Next, the exists test is not needed and dropping it spares an operation. However, I consistently get a touch better runtimes with it , while it also conveys the purpose better. Altogether I am leaving it in. Thanks to ikegami for comments.

Note The commented out version is about 50% faster than the other, by my benchmark below. These are both given because they are different , one finding the first match and the other the second field. I am keeping it this way as a more generic choice, since the question is ambiguous on that.


Some comparisons (benchmark) [Updated for writing to STDOUT , see "Update" above]

There is an extensive analysis in the answer by HåkonHægland , timing one run of most solutions. Here is another take, benchmarking the two solutions above, the OP's own answer, and the posted fgrep one, expected to be fast and used in the question and in many answers.

I build test data in the following way. A handful of lines of the length roughly as shown are made with random words, for both files, so to match in the second field. Then I pad this "seed" for data samples with lines that won't match, so to mimic ratios between sizes and matches quoted by OP: for 14K lines in small file there are 1.3M lines in the big file, yielding 126K matches. Then these samples are written repeatedly to build full data files as OP's, shuffle -ed each time using List::Util .

All runs compared below produce 106_120 matches for the above file sizes ( diff -ed for a check), so the matching frequency is close enough. They are benchmarked by calling complete programs using my $res = timethese(60 ...) . The result of cmpthese($res) on v5.16 are

Rate regex  cfor split fgrep
regex 1.05/s    --  -23%  -35%  -44%
cfor  1.36/s   30%    --  -16%  -28%
split 1.62/s   54%   19%    --  -14%
fgrep 1.89/s   80%   39%   17%    --

The fact that the optimized C-program fgrep comes on top isn't surprising. The lag of " regex " behind " split " may be due to the overhead of starting the engine for little matches, many times. This may vary over Perl versions, given the evolving regex engine optimizations. I include the answer of @codeforester (" cfor ") since it was claimed to be fastest, and its 20% lag behind the very similar " split " is likely due to scattered small inefficiencies (see a comment below this answer).

This isn't shatteringly different, while there are sure variations across hardware and software and over data details. I ran this on different Perls and machines, and the notable difference is that in some cases fgrep was indeed an order of magnitude faster .

The OP's experience of very slow fgrep is surprising. Given their quoted run times, order of magnitude slower than the above, I'd guess that there's an old system to "blame."

Even though this is completely I/O based there are concurrency benefits from putting it on multiple cores and I'd expect a good speedup, up to a factor of a few.


Alas, the comment got deleted (?). In short: unneeded use of a scalar (costs), of an if branch, of defined , of printf instead of print (slow!). These matter for efficiency on 2 billion lines.

I have tried to do a comparison between some of the methods presented here.

First I created a Perl script to generate the input files file1.txt and file2.txt . In order to compare some of the solutions, I made sure that the the words from file1.txt only could appear in the second field in file2.txt . Also to be able to use the join solution presented by @GeorgeVasiliou, I sorted file1.txt and file2.txt . Currently I generated the input files based on only 75 random words ( taken from https://www.randomlists.com/random-words ). Only 5 of these 75 words was used in file1.txt the remaining 70 words was used to fill up the fields in file2.txt . It might be necessary to increase the number of words substantially to get realistic results ( according to the OP, the original file1.txt contained 14000 words). In the tests below I used a file2.txt with 1000000 ( 1 million ) lines. The script also generates the file regexp1.txt required by the grep solution of @BOC.

gen_input_files.pl :

#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;

use Data::Printer;
use Getopt::Long;

GetOptions ("num_lines=i" => \my $nlines )
  or die("Error in command line arguments\n");

# Generated random words from site: https://www.randomlists.com/random-words
my $word_filename        = 'words.txt'; # 75 random words
my $num_match_words      = 5;
my $num_file2_lines      = $nlines || 1_000_000;
my $file2_words_per_line = 3;
my $file2_match_field_no = 2;
my $file1_filename       = 'file1.txt';
my $file2_filename       = 'file2.txt';
my $file1_regex_fn       = 'regexp1.txt';

say "generating $num_file2_lines lines..";
my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words );

write_file1( $file1_filename, $words2 );
write_file2(
    $file2_filename, $words1, $words2, $num_file2_lines,
    $file2_words_per_line, $file2_match_field_no
);
write_BOC_regexp_file( $file1_regex_fn, $words2 );


sub write_BOC_regexp_file {
    my ( $fn, $words ) = @_;

    open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
    print $fh '\\|' . (join "|", @$words) . '\\|';
    close $fh;
}

sub write_file2 {
    my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_;

    my $nwords1 = scalar @$words1;
    my $nwords2 = scalar @$words2;
    my @lines;
    for (1..$nlines) {
        my @words_line;
        my $key;
        for (1..$words_per_line) {
            my $word;
            if ( $_ != $field_no ) {
                my $index = int (rand $nwords1);
                $word = @{ $words1 }[$index];
            }
            else {
                my $index = int (rand($nwords1 + $nwords2) );
                if ( $index < $nwords2 ) {
                    $word = @{ $words2 }[$index];
                }
                else {
                    $word =  @{ $words1 }[$index - $nwords2];
                }
                $key = $word;
            }
            push @words_line, $word;
        }
        push @lines, [$key, (join "|", @words_line)];
    }
    @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; 
    open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
    print $fh (join "\n", @lines);
    close $fh;
}

sub write_file1 {
    my ( $fn, $words ) = @_;

    open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
    print $fh (join "\n", sort @$words);
    close $fh;
}

sub get_words {
    my ( $fn, $N ) = @_;

    open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
    my @words = map {chomp $_; $_} <$fh>;
    close $fh;

    my @words1 = @words[$N..$#words];
    my @words2 = @words[0..($N - 1)];
    return ( \@words1, \@words2 );
}

Next, I created a sub folder solutions with all the test cases:

$ tree solutions/
solutions/
├── BOC1
│   ├── out.txt
│   └── run.sh
├── BOC2
│   ├── out.txt
│   └── run.sh
├── codeforester
│   ├── out.txt
│   ├── run.pl
│   └── run.sh
[...]

Here the files out.txt is the output from the greps for each solution. The scripts run.sh runs the solution for the given test case.

Notes on the different solutions

  • BOC1 : First solution presented by @BOC

    grep -E -f regexp1.txt file2.txt
  • BOC2 : Second solution suggested by @BOC:

     LC_ALL=C grep -E -f regexp1.txt file2.txt
  • codeforester : Accepted Perl solution by @codeforester ( see source )

  • codeforester_orig : Original solution presented by @codeforested:

     fgrep -f file1.txt file2.txt
  • dawg : Python solution using dictionary and split line proposed by @dawg ( see source )

  • gregory1 : solution using Gnu Parallel suggested by @gregory

    parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt

    See note below regarding how to choose $block_size .

  • hakon1 : Perl solution provided by @HåkonHægland (see source ). This solution requires compilation of the c-extension the first time the code is run. It does not require recompilation when file1.txt or file2.txt changes. Note: The time used to compile the c-extension at the initial run is not included in the run times presented below.

  • ikegami : Solution using assembled regexp and using grep -P as given by @ikegami. Note: The assembled regexp was written to a separate file regexp_ikegami.txt , so the runtime of generating the regexp is not included in the comparison below. This is the code used:

     regexp=$(< "regexp_ikegami.txt") grep -P "$regexp" file2.txt
  • inian1 : First solution by @Inian using match()

    awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt
  • inian2 : Second solution by @Inian using index()

    awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (index($0,i)) {print; break} }' file1.txt FS='|' file2.txt
  • inian3 : Third solution by @Inian checking only $2 field:

     awk 'FNR==NR{ hash[$1]; next } $2 in hash' file1.txt FS='|' file2.txt
  • inian4 : 4th soultion by @Inian ( basically the same as codeforester_orig with LC_ALL ) :

     LC_ALL=C fgrep -f file1.txt file2.txt
  • inian5 : 5th solution by @Inian (same as inian1 but with LC_ALL ):

     LC_ALL=C awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt
  • inian6 : Same as inian3 but with LC_ALL=C . Thanks to @GeorgeVasiliou for suggestion.

  • jjoao : Compiled flex-generated C code as proposed by @JJoao (see source ). Note: Recompilation of the exectuable must be done each time file1.txt changes. The time used to compile the executable is not included in the run times presented below.

  • oliv : Python script provided by @oliv ( see source )

  • Vasiliou : Using join as suggested by @GeorgeVasiliou:

     join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt
  • Vasiliou2 : Same as Vasiliou but with LC_ALL=C .

  • zdim : Using Perl script provided by @zdim ( see source ). Note: This uses the regexp search version ( instead of split line solution ).

  • zdim2 : Same as zdim except that it uses the split function instead of regexp search for the field in file2.txt .

Notes

  1. I experimented a little bit with Gnu parallel (see gregory1 solution above) to determine the optimal block size for my CPU. I have 4 cores, and and currently it seems that the optimal choice is to devide the file ( file2.txt ) into 4 equal sized chunks, and run a single job on each of the 4 processors. More testing might be needed here. So for the first test case where file2.txt is 20M, I set $block_size to 5M ( see gregory1 solution above), whereas for the more realistic case presented below where file2.txt is 268M, a $block_size of 67M was used.

  2. The solutions BOC1 , BOC2 , codeforester_orig , inian1 , inian4 , inian5 , and gregory1 all used loose matching. Meaning that the words from file1.txt did not have to match exactly in field #2 of file2.txt . A match anywhere on the line was accepted. Since this behavior made it more difficult to compare them with the other methods, some modified methods were also introduced. The first two methods called BOC1B and BOC2B used a modified regexp1.txt file. The lines in the original regexp1.txt where on the form \\|foo1|foo2|...|fooN\\| which would match the words at any field boundary. The modified file, regexp1b.txt , anchored the match to field #2 exclusively using the form ^[^|]*\\|foo1|foo2|...|fooN\\| instead.

    Then the rest of the modified methods codeforester_origB , inian1B , inian4B , inian5B , and gregory1B used a modified file1.txt . Instead of a literal word per line, the modified file file1b.txt used one regex per line on the form:

     ^[^|]*\\|word1\\| ^[^|]*\\|word2\\| ^[^|]*\\|word3\\| [...]

    and in addition, fgrep -f was replaced by grep -E -f for these methods.

Running the tests

Here is the script used for running all the tests. It uses the Bash time command to record the time spent for each script. Note that the time command returns three different times call real , user , and sys . First I used user + sys , but realized that this was incorrect when using Gnu parallel command, so the time reported below is now the real part returned by time . See this question for more information about the different times returned by time .

The first test is run with file1.txt containing 5 lines, and file2.txt containing 1000000 lines. Here is the first 52 lines of the run_all.pl script, the rest of the script is available here .

run_all.pl

#! /usr/bin/env perl

use feature qw(say);
use strict;
use warnings;

use Cwd;
use Getopt::Long;
use Data::Printer;
use FGB::Common;
use List::Util qw(max shuffle);
use Number::Bytes::Human qw(format_bytes);
use Sys::Info;

GetOptions (
    "verbose"       => \my $verbose,
    "check"         => \my $check,
    "single-case=s" => \my $case,
    "expected=i"    => \my $expected_no_lines,
) or die("Error in command line arguments\n");

my $test_dir    = 'solutions';
my $output_file = 'out.txt';
my $wc_expected = $expected_no_lines; # expected number of output lines

my $tests       = get_test_names( $test_dir, $case );

my $file2_size  = get_file2_size();
my $num_cpus    = Sys::Info->new()->device( CPU => () )->count;

chdir $test_dir;
my $cmd = 'run.sh';
my @times;
for my $case (@$tests) {
    my $savedir = getcwd();
    chdir $case;
    say "Running '$case'..";
    my $arg = get_cmd_args( $case, $file2_size, $num_cpus );
    my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`;
    my ($user, $sys, $real ) = get_run_times( $output );
    print_timings( $user, $sys, $real ) if $verbose;
    check_output_is_ok( $output_file, $wc_expected, $verbose, $check );
    print "\n" if $verbose;
    push @times, $real;
    #push @times, $user + $sys; # this is wrong when using Gnu parallel
    chdir $savedir;
}

say "Done.\n";

print_summary( $tests, \@times );

Results

Here is the output from running the tests:

$  run_all.pl --verbose
Running 'inian3'..
..finished in 0.45 seconds ( user: 0.44, sys: 0.00 )
..no of output lines: 66711

Running 'inian2'..
..finished in 0.73 seconds ( user: 0.73, sys: 0.00 )
..no of output lines: 66711

Running 'Vasiliou'..
..finished in 0.09 seconds ( user: 0.08, sys: 0.00 )
..no of output lines: 66711

Running 'codeforester_orig'..
..finished in 0.05 seconds ( user: 0.05, sys: 0.00 )
..no of output lines: 66711

Running 'codeforester'..
..finished in 0.45 seconds ( user: 0.44, sys: 0.01 )
..no of output lines: 66711

[...]

Summary

[Results obtained by @Vasiliou are shown in the middle column.]

                               |Vasiliou
My Benchmark                   |Results  |   Details
-------------------------------|---------|----------------------
inian4             : 0.04s     |0.22s    | LC_ALL fgrep -f [loose] 
codeforester_orig  : 0.05s     |         | fgrep -f [loose]
Vasiliou2          : 0.06s     |0.16s    | [LC_ALL join [requires sorted files]]
BOC1               : 0.06s     |         | grep -E [loose] 
BOC2               : 0.07s     |15s      | LC_ALL grep -E [loose] 
BOC2B              : 0.07s     |         | LC_ALL grep -E [strict] 
inian4B            : 0.08s     |         | LC_ALL grep -E -f [strict] 
Vasiliou           : 0.08s     |0.23s    | [join [requires sorted files]] 
gregory1B          : 0.08s     |         | [parallel + grep -E -f [strict]] 
ikegami            : 0.1s      |         | grep -P 
gregory1           : 0.11s     |0.5s     | [parallel + fgrep -f [loose]] 
hakon1             : 0.14s     |         | [perl + c]
BOC1B              : 0.14s     |         | grep -E [strict] 
jjoao              : 0.21s     |         | [compiled flex generated c code] 
inian6             : 0.26s     |0.7s     | [LC_ALL awk + split+dict] 
codeforester_origB : 0.28s     |         | grep -E -f [strict] 
dawg               : 0.35s     |         | [python + split+dict] 
inian3             : 0.44s     |1.1s     | [awk + split+dict] 
zdim2              : 0.4s      |         | [perl + split+dict] 
codeforester       : 0.45s     |         | [perl + split+dict] 
oliv               : 0.5s      |         | [python + compiled regex + re.search()] 
zdim               : 0.61s     |         | [perl + regexp+dict] 
inian2             : 0.73s     |1.7s     | [awk + index($0,i)] 
inian5             : 18.12s    |         | [LC_ALL awk + match($0,i) [loose]] 
inian1             : 19.46s    |         | [awk + match($0,i) [loose]] 
inian5B            : 42.27s    |         | [LC_ALL awk + match($0,i) [strict]] 
inian1B            : 85.67s    |         | [awk + match($0,i) [strict]] 

Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.

A more realistic test case

I then created a more realistic case with file1.txt having 100 words and file2.txt having 10 million lines (268Mb file size). I extracted 1000 random words from the dictionary at /usr/share/dict/american-english using shuf -n1000 /usr/share/dict/american-english > words.txt then extracted 100 of these words into file1.txt and then constructed file2.txt the same way as described above for the first test case. Note that the dictionary file was UTF-8 encoded, and I stripped away all non-ASCII characters from the words.txt .

Then I run the test without the three slowest methods from the previous case. Ie inian1 , inian2 , and inian5 were left out. Here are the new results:

gregory1           : 0.86s     | [parallel + fgrep -f [loose]]
Vasiliou2          : 0.94s     | [LC_ALL join [requires sorted files]]
inian4B            : 1.12s     | LC_ALL grep -E -f [strict] 
BOC2B              : 1.13s     | LC_ALL grep -E [strict] 
BOC2               : 1.15s     | LC_ALL grep -E [loose] 
BOC1               : 1.18s     | grep -E [loose] 
ikegami            : 1.33s     | grep -P 
Vasiliou           : 1.37s     | [join [requires sorted files]]
hakon1             : 1.44s     | [perl + c]
inian4             : 2.18s     | LC_ALL fgrep -f [loose] 
codeforester_orig  : 2.2s      | fgrep -f [loose] 
inian6             : 2.82s     | [LC_ALL awk + split+dict] 
jjoao              : 3.09s     | [compiled flex generated c code] 
dawg               : 3.54s     | [python + split+dict] 
zdim2              : 4.21s     | [perl + split+dict]
codeforester       : 4.67s     | [perl + split+dict] 
inian3             : 5.52s     | [awk + split+dict] 
zdim               : 6.55s     | [perl + regexp+dict] 
gregory1B          : 45.36s    | [parallel + grep -E -f [strict]] 
oliv               : 60.35s    | [python + compiled regex + re.search()] 
BOC1B              : 74.71s    | grep -E [strict] 
codeforester_origB : 75.52s    | grep -E -f [strict] 

Note

The grep based solutions were looking for a match on the whole line, so in this case they contained some false matches: the methods codeforester_orig , BOC1 , BOC2 , gregory1 , inian4 , and oliv extracted 1,087,609 lines out of 10,000,000 lines, whereas the other methods extracted the correct 997,993 lines from file2.txt .

Notes

  • I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)

  • The whole benchmark study is available here .

Did you try Awk that could speed up things a bit:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

(or) using index() function in Awk as suggested by comments from Benjamin W. , below

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}' file1.txt FS='|' file2.txt

(or) a more direct regex match as suggested by Ed Morton in comments,

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}' file1.txt FS='|' file2.txt

is all you need. I'm guessing this will be faster but not exactly sure on files with million+ entries. Here the problem is with the possibility match in anywhere along the line. Had the same been in any particular column (eg say $2 alone), a faster approach could be

awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

Also you could speed-things up by playing with the locale set in your system. Paraphrasing from this wonderful Stéphane Chazelas's answer on the subject, you could speed up things pretty quickly by setting passing the locale LC_ALL=C to the command locally being run.

On any GNU based system, the defaults for the locale

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

With one variable LC_ALL , you can set all LC_ type variables at once to a specified locale

$ LC_ALL=C locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"       
LC_ALL=C

So what does this impact?

Simply put, when using the locale C it will default to the server's base Unix/Linux language of ASCII . Basically when you grep something, by default your locale is going to be internationalized and set to UTF-8 , which can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000 unique characters, whereas with ASCII each character is encoded in a single byte sequence and its character set comprises of no longer than 128 unique characters.

So it translates to this, when using grep on a file encoded in UTF-8 character-set, it needs to match each character with any of the hundred thousand unique characters, but just 128 in ASCII , so use your fgrep as

LC_ALL=C fgrep -F -f file1.txt file2.txt

Also, the same can be adapted to the Awk , since it uses a regex match with the match($0,i) call, setting the C locale could speed up the string match.

LC_ALL=C awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

Assumptions: 1. You want to run this search on just your local workstation. 2. Your have multiple cores/cpus to take advantage of a parallel search.

parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt

Some further tweaks depending on the context: A. Disable NLS with LANG=C (this is mentioned already in another answer) B. Set a max number of matches with the -m flag.

Note: I'm guessing that file2 is ~4GB and the 10M block size is ok, but you may need to optimize the block size to get the fastest run.

A small piece of Perl code solved the problem. This is the approach taken:

  • store the lines of file1.txt in a hash
  • read file2.txt line by line, parse and extract the second field
  • check if the extracted field is in the hash; if so, print the line

Here is the code:

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/\|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);

I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time output for the same:

real    0m11.694s
user    0m11.507s
sys 0m0.174s

I ran @Inian's awk code:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt and producing 40K matched lines. Here is how long it took:

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

Using @Inian's other awk solution, which eliminates the looping issue:

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk is very impressive here, given that we didn't have to write an entire program to do it.

I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

real    895m14.862s
user    806m59.219s
sys 1m12.147s

I tried to follow the suggestion to use parallel . However, it failed with fgrep: memory exhausted error, even with very small block sizes.


What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

I didn't check join approach - I didn't want the additional overhead of sorting the files. Also, given fgrep 's poor performance, I don't believe join would have done better than the Perl code.

Thanks everyone for your attention and responses.

This Perl script ( a ) generates a regex pattern:

#!/usr/bin/perl

use strict;
use warnings;

use Regexp::Assemble qw( );

chomp( my @ids = <> );
my $ra = Regexp::Assemble->new();
$ra->add(quotemeta($_)) for @ids;
print("^[^|]*\\|(?:" . (re::regexp_pattern($ra->re()))[0] . ")\\|");

Here's how it can be used:

$ LC_ALL=C grep -P "$( a file1.txt )" file2.txt
date1|foo1|number1
date2|foo2|number2
date1|bar1|number1
date2|bar2|number2

Note the the script uses Regexp::Assemble, so you may need to install it.

sudo su
cpan Regexp::Assemble

Notes:

  • Unlike the solutions dubbed BOC1, BOC2, codeforester_orig, gregory1, inian2, inian4 and oliv, my solution correctly handles

    file1.txt foo1 file2.txt date1|foo12|number5
  • Mine should be better than the similar solution by @BOC because the pattern is optimized to reduce backtracking. (Mine also works if there are more than three fields in file2.txt , whereas the linked solution can fail.)

  • I don't know how it compares to the split+dictionary solutions.

Here is Perl solution that usesInline::C to speed up the search for matching fields in the large file:

use strict;
use warnings;
use Inline C => './search.c';

my $smallfile = 'file1.txt';
my $bigfile   = 'file2.txt';

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";
my %word = map { chomp; $_ => 1 } <$fh>;
search( $bigfile, \%word );

The search() sub routine is implemented in pure C using perlapi to look up keys in the small file dictionary %words :

search.c :

#include <stdio.h>
#include <sys/stat.h> 
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>


#define BLOCK_SIZE 8192       /* how much to read from file each time */
static char read_buf[BLOCK_SIZE + 1];

/*  reads a block from file, returns -1 on error, 0 on EOF, 
     else returns chars read, pointer to buf, and pointer to end of buf  */
size_t read_block( int fd, char **ret_buf, char **end_buf ) {
    int ret;
    char *buf = read_buf;
    size_t len = BLOCK_SIZE;
    while (len != 0 && (ret = read(fd, buf, len)) != 0) {
        if (ret == -1) {
            if (errno == EINTR)
                continue;
            perror( "read" );
            return ret;
        }
        len -= ret;
        buf += ret;
    }
    *end_buf = buf;
    *ret_buf = read_buf;
    return (size_t) (*end_buf - *ret_buf);
}

/* updates the line buffer with the char pointed to by cur,
   also updates cur
    */
int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) {
    if ( *llen > max_line_len ) {
        fprintf( stderr, "Too long line. Maximimum allowed line length is %ld\n",
                 max_line_len );
        return 0;
    }
    **line = **cur;
    (*line)++;
    (*llen)++;
    (*cur)++; 
    return 1;
}


/*    search for first pipe on a line (or next line if this is empty),
    assume line ptr points to beginning of line buffer.
  return 1 on success
  Return 0 if pipe could not be found for some reason, or if 
    line buffer length was exceeded  */
int search_field_start(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len
) {
    char *line_start = *line;

    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return 0;
        }
        if ( **cur == '|' ) break;
        /* Currently we just ignore malformed lines ( lines that do not have a pipe,
           and empty lines in the input */
        if ( **cur == '\n' ) {
            *line = line_start;
            *llen = 0;
            (*cur)++;
        }
        else {
            if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        }
    }
    return 1;
}

/* assume cur points at starting pipe of field
  return -1 on read error, 
  return 0 if field len was too large for buffer or line buffer length exceed,
  else return 1
  and field, and  length of field
 */
int copy_field(
    int fd, char **cur, char **end_buf, char *field,
    size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len
) {
    *flen = 0;
    while( 1 ) {
        if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return -1;
        }
        if ( **cur == '|' ) break;
        if ( *flen > max_field_len ) {
            printf( "Field width too large. Maximum allowed field width: %ld\n",
                    max_field_len );
            return 0;
        }
        *field++ = **cur;
        (*flen)++;
    }
    /* It is really not necessary to null-terminate the field 
       since we return length of field and also field could 
       contain internal null characters as well
    */
    //*field = '\0';
    return 1;
}

/* search to beginning of next line,
  return 0 on error,
  else return 1 */
int search_eol(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len)
{
    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return 0;
        }
        if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *(*cur-1) == '\n' ) {
            break;
        }
    }
    //**line = '\0'; // not necessary
    return 1;
}

#define MAX_FIELD_LEN 80  /* max number of characters allowed in a field  */
#define MAX_LINE_LEN 80   /* max number of characters allowed on a line */

/* 
   Get next field ( i.e. field #2 on a line). Fields are
   separated by pipes '|' in the input file.
   Also get the line of the field.
   Return 0 on error,
   on success: Move internal pointer to beginning of next line
     return 1 and the field.
 */
size_t get_field_and_line_fast(
    int fd, char *field, size_t *flen, char *line, size_t *llen
) {
    static char *cur = NULL;
    static char *end_buf = NULL;

    size_t res;
    if (cur == NULL) {
        res = read_block( fd, &cur, &end_buf );        
        if ( res <= 0 ) return 0;
    }
    *llen = 0;
    if ( !search_field_start( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN )) return 0;
    if ( (res = copy_field(
        fd, &cur, &end_buf, field, flen, &line, llen, MAX_FIELD_LEN, MAX_LINE_LEN
    ) ) <= 0)
        return 0;
    if ( !search_eol( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN ) ) return 0;
    return 1;
}

void search( char *filename, SV *href) 
{
    if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) {
        croak( "Not a hash reference" );
    }

    int fd = open (filename, O_RDONLY);
    if (fd == -1) {
        croak( "Could not open file '%s'", filename );
    }
    char field[MAX_FIELD_LEN+1];
    char line[MAX_LINE_LEN+1];
    size_t flen, llen;
    HV *hash = (HV *)SvRV( href );
    while ( get_field_and_line_fast( fd, field, &flen, line, &llen ) ) {
        if( hv_exists( hash, field, flen ) )
            fwrite( line, sizeof(char), llen, stdout);
    }
    if (close(fd) == -1)
        croak( "Close failed" );

}

Tests indicate that it is approximately 3 times faster than the fastest pure Perl solution (see method zdim2 in my other answer ) presented here.

Here is a Python solution using sets -- roughly equivalent to a Perl key only hash or awk array in concept.

#!/usr/bin/python

import sys 

with open(sys.argv[1]) as f:
    tgt={e.rstrip() for e in f}

with open(sys.argv[2]) as f:
    for line in f:
        cells=line.split("|")
        if cells[1] in tgt:
            print line.rstrip()

When I run this on files of similar size, it runs in about 8 seconds.

Same speed as:

$ awk 'FNR==NR{arr[$1]; next} $2 in arr{print $0}' FS="|" /tmp/f1 /tmp/f2 

Both the Python and awk solution here are full string match only; not a partial regex style match.

Since the awk solution is fast and POSIX compliant, that is the better answer.

Can you give a try to join ? Files must be sorted though...

$ cat d.txt
bar1
bar2
foo1
foo2

$ cat e.txt
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3

$ join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 d.txt e.txt
date1|bar1|number1
date2|bar2|number2
date1|foo1|number1
date2|foo2|number2

Small Update:
By using LC_ALL=C in front of join, things are really speed up as can be seen in the benchmark of Håkon Hægland

PS1: I have my doubts if join can be faster than grep -f ...

Though this thread is over, but all grep-alike methods between two files are gathered in this post, why not to add this awk alternative, similar (or even improved) to the bounty winning Inian's awk solution:

awk 'NR==FNR{a[$0]=1;next}a[$2]' patterns.txt FS="|" datafile.txt >matches.txt # For matches restricted on Field2 of datafile

This is equivalent to Inian awk $2 in hash solution but it could be even faster due to the fact that we don't ask awk to check if the whole hash array contains $2 of file2 - we just check if a[$2] has a value or not.

While reading the first patterns file appart from creating the hash array we assign also a value.

If $2 of datafile had been found before in patterns file, then a[$2] would have a value and thus will be printed because is not null.

if a[$2] of datafile returns no value(null) this is translated to false => no printing.

Extension to match any of the three fields of datafile:

awk 'NR==FNR{a[$0]=1;next}(a[$1] || a[$2] || a[$3])' patterns.txt FS="|" datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern.

In both cases, applying LC_ALL=C in front of awk, seems to speed things up.

PS1: Offcourse this solution has also the pitfalls of all awk solutions. Is not a pattern matching. Is a direct / fixed matching between of the two files, like most of the solutions inhere.

PS2: In my poor machine benchmark using the small benchmark files of Håkon Hægland , i get about 20% better performance comparing to the awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

A possible way is to use python :

$ cat test.py
import sys,re

with open(sys.argv[1], "r") as f1:
    patterns = f1.read().splitlines() # read pattern from file1 without the trailing newline

m = re.compile("|".join(patterns))    # create the regex

with open(sys.argv[2], "r") as f2:
    for line in f2: 
        if m.search(line) : 
            print line,               # print line from file2 if this one matches the regex

and use it like this:

python test.py file1.txt file2.txt

You can also use Perl for this:

Please note that this will hog memory and your machine/server better has some.

Sample Data:

%_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt
==> file1.txt <==
foo1
foo2
...
bar1
bar2
...

==> file2.txt <==
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3
...
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
%_STATION@gaurav * /root/ga/study/pl>

Script Output: Script will produce final output in a file named output_comp .

%_STATION@gaurav * /root/ga/pl> ./comp.pl  file1.txt file2.txt ; cat output_comp
date1|bar1|number1
date2|bar2|number2
date2|foo2|number2
date1|foo1|number1
%_STATION@gaurav * /root/ga/pl>

Script:

%_STATION@gaurav * /root/ga/pl> cat comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file1,$file2) = @ARGV ;
my $output = "output_comp" ;
my %hash ;    # This will store main comparison data.
my %tmp ;     # This will store already selected results, to be skipped.
(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;

# Read all files at once and use their name as the key.
for (@ARGV) {
  open FH, "<$_" or die "Cannot open $_\n" ;
  while  (my $line = <FH>) {chomp $line ;$hash{$_}{$line} = "$line"}
  close FH ;
}

# Now we churn through the data and compare to generate
# the sorted output in the output file.
open FH, ">>$output" or die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
  foreach my $k2 (keys %{$hash{$file2}}){
    if ($k1 =~ m/^.+?$k2.+?$/) {
      if (!defined $tmp{"$hash{$file2}{$k2}"}) {
        print FH "$hash{$file2}{$k2}\n" ;
        $tmp{"$hash{$file2}{$k2}"} = 1 ;
      }
    }
  }
}
close FH  ;
%_STATION@gaurav * /root/ga/pl>

Thanks.

IMHO, grep is a good tool highly optimised for huge file2.txt but maybe not for so many patterns to search. I suggest to combine all the strings of file1.txt into a single huge regexp like \\|bar1|bar2|foo1|foo2\\|

echo  '\|'$(paste -s -d '|' file1.txt)'\|' > regexp1.txt

grep -E -f regexp1.txt file2.txt > file.matched

And of course LANG=C may help. Please give feedback or send your files so I can test myself.

I would use SQLite3 :) Maybe in-memory database or whatever. Import the files and use SQL query.

Using flex :

1: build the flex processor:

$ awk 'NR==1{ printf "%%%%\n\n.*\\|(%s",$0 } 
            { printf "|%s",$0 } 
       END  { print ")\\|.*\\n ECHO;\n.*\\n ;\n%%\n" }' file1.txt > a.fl

2: compile it

$ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl

3: and run

$ a.out < file2.txt  > out

Compiling (cc ...) is a slow process; this approach will pay just for cases of stable file1.txt

(In my machine) The times taken to run a search "100 in 10_000_000" test in this approach is 3 times faster than LC_ALL=C fgrep...

setting language etc helps a little, perhaps.

otherwise I can not think of a magic solution to escape from your basic issue: the data is not structured, so you will have a search that comes down to the number of lines in file1 multiplied with the number of lines in file2.

put the billion lines in a database, and index it in a smart way, is the only speed up I can think of. that index would have to be very smart, though......

SImple solution is: have enough memory to fit everything in to. otherwise nothing much more you can do about this....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM