more efficient way to extract lines from a file whose first column matches another file

Question

I have two files, target and clean.

Target has some 1055772 lines, each of which has 3000 columns, tab separated. (size is 7.5G)
Clean is slightly shorter at 806535. Clean only has one column, that matches the format of the first column of Target. (size is 13M)

I want to extract the lines of target that has a matching first column to a line in clean.

I wrote a grep based loop to do this but its painfully slow. Speedups will be rewarded with upvotes and/or smileys.

clean  = "/path/to/clean"
target = "/path/to/target"
oFile  = "/output/file"

head -1 $target > $oFile
cat $clean | while read snp; do
    echo $snp
    grep $snp $target >> $oFile
done

$ head $clean
1_111_A_G
1_123_T_A
1_456_A_G
1_7892_C_G

Edit: Wrote a simple python script to do it.

 clean_variants_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/clean_variants"

allChr_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/allChr_snpStats"

outfile = open("/scratch2/vyp-scratch2/cian/UCLex_August2014/results.tab","w")

 clean_variant_dict = {}


for line in open(clean_variants_file):

clean_variant_dict[line.strip()] = 0


for line in open(allChr_file):

ll = line.strip().split("\t")

id_ = ll[0]

if id_ in clean_variant_dict:

    outfile.write(line)



 outfile.close()

Answer 1

This Perl solution would use quite a lot of memory (because we load the entire file into memory), but would save you from looping twice. It uses a hash for duplicate checking, where each line is stored as a key. Note that this code is not thoroughly tested, but seems to work on a limited set of data.

use strict;
use warnings;

my ($clean, $target) = @ARGV;

open my $fh, "<", $clean or die "Cannot open file '$clean': $!";

my %seen;
while (<$fh>) {
    chomp;
    $seen{$_}++;
}

open $fh, "<", $target 
        or die "Cannot open file '$target': $!";    # reuse file handle

while (<$fh>) {
    my ($first) = /^([^\t]*)/;
    print if $seen{$first};
}

If your target file is proper tab separated CSV data, you could use Text::CSV_XS which reportedly is very fast.

Answer 2

python solution:

with open('/path/to/clean', 'r') as fin:
    keys = set(fin.read().splitlines())

with open('/path/to/target', 'r') as fin, open('/output/file', 'w') as fout:
    for line in fin:
        if line[:line.index('\t')] in keys:
            fout.write(line)

Answer 3

Using a perl one-liner:

perl -F'\t' -lane '
    BEGIN{ local @ARGV = pop; @s{<>} = () }
    print if exists $s{"$F[0]\n"}
  ' target clean

Switches:

-F : Alternate pattern for -a switch
-l : Enable line ending processing
-a : Splits the line on space and loads them in an array @F
-n : Creates a while(<>){...} loop for each “line” in your input file.
-e : Tells perl to execute the code on command line.

Or as a perl script:

use strict;
use warnings;

die "Usage: $0 target clean\n" if @ARGV != 2;

my %s = do {
    local @ARGV = pop;
    map {$_ => 1} (<>)
};

while (<>) {
    my ($f) = split /\t/;
    print if $s{"$f\n"}
}

Answer 4

For fun, I thought I would convert a solution or two into Perl6.

Note: These are probably going to be slower than the originals until Rakudo/NQP gets more optimizations, which really only started in earnest fairly recently at the time of posting.

First is TLP's Perl5 answer converted nearly one-to-one into Perl6.

 #! /usr/bin/env perl6 # I have a link named perl6 aliased to Rakudo on MoarVM-jit use v6; multi sub MAIN ( Str $clean, Str $target ){ # same as the Perl5 version MAIN( :$clean, :$target ); # call the named version } multi sub MAIN ( Str :$clean!, Str :$target! ){ # using named arguments note "Processing clean file"; my %seen := SetHash.new; for open( $clean, :r ).lines -> $line { next unless $line.chars; # skip empty lines %seen{$line}++; } note "Processing target file"; for open( $target, :r ).lines -> $line { $line ~~ /^ $<first> = <-[\\t]>+ /; say $line if %seen{$<first>.Str}; } }

I used MAIN subroutines so that you will get a Usage message if you don't give it the correct arguments.
I also used a SetHash instead of a regular Hash to reduce memory use since we don't need to know how many we have found, only that they were found.

Next I tried to combine all of the lines in the clean file into one regex.
This is similar to the sed and grep answer from Cyrus , except instead of many regexes there is only one.
I didn't want to change the subroutine that I had already written, so I added one that is differentiated by adding --single-regex or -s to the command line. ( All of the examples are in the same file )
```
 multi sub MAIN ( Str :$clean!, Str :$target!, Bool :single-regex(:s($))! ){ note "Processing clean file"; my $regex; { my @regex = open( $clean, :r ).lines.grep(*.chars); $regex = /^ [ | @regex ] /; } # throw away @regex note "Processing target file"; for open( $target, :r ).lines -> $line { say $line if $line ~~ $regex; } } 
```

I will say that I took quite a bit longer to write this than it would have taken me to write it in Perl5. Most of the time was taken up searching for some idioms online, and looking over the source files for Rakudo. I don't think it would take much effort to get better at Perl6 than Perl5.

more efficient way to extract lines from a file whose first column matches another file

Question

4 answers

solution1
3 ACCPTED 2014-09-14 15:48:39

solution2
2 2014-09-14 15:56:44

solution3
2 2014-09-14 15:58:54

solution4
0 2014-09-15 22:11:19

more efficient way to extract lines from a file whose first column matches another file

Question

4 answers

solution1 3 ACCPTED 2014-09-14 15:48:39

solution2 2 2014-09-14 15:56:44

solution3 2 2014-09-14 15:58:54

solution4 0 2014-09-15 22:11:19

solution1
3 ACCPTED 2014-09-14 15:48:39

solution2
2 2014-09-14 15:56:44

solution3
2 2014-09-14 15:58:54

solution4
0 2014-09-15 22:11:19