简体   繁体   中英

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.

file A is something like:

Name (tab)  #  (tab)  #  (tab)  KEYFIELD  (tab)  Other fields

file BI managed to use cut and sed and other things to basically get it down to one field that is a list.

So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)

I tried to do:

grep -f fileBcutdown fileA > outputfile

EDIT: Ok I give up. I just force killed it.

Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.

EDIT: This is an example line in file A:

chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,

example line from file B cut down:

ENST00000111111

You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.

You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.

Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.

Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:

#! /usr/bin/env perl

use strict;
use warnings;
use autodie;
use feature qw(say);

# Create your index
open my $file_b, "<", "file_b.txt";
my %index;

while (my $line = <$file_b>) {
    chomp $line;
    $index{$line} = $line;    #Or however you do it...
}
close $file_b;


#
# Now check against file_a.txt
#

open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
    chomp $line;
    my @fields = split /\s+/, $line;
    if (exists $index{$field[3]}) {
         say "Line: $line";
    }
}
close $file_a;

The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.

Here's one way using GNU awk . Run like:

awk -f script.awk fileB.txt fileA.txt

Contents of script.awk :

FNR==NR {
    array[$0]++
    next
}

{
    line = $4
    sub(/\.[0-9]+$/, "", line)
    if (line in array) {
        print
    }
}

Alternatively, here's the one-liner:

awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt

GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed . If you would like me to build this into the above script, you will need to provide an example of what this line looks like.


UPDATE using files HumanGenCodeV12 and GenBasicV12 :

Run like:

awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt

Contents of script.awk :

FNR==NR {
    gsub(/[^[:alnum:]]/,"",$12)
    array[$12]++
    next
}

{
    line = $4
    sub(/\.[0-9]+$/, "", line)
    if (line in array) {
        print
    }
}

This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12 . The output file ( output.txt ) contains 65340 lines. The script takes less than 10 seconds to complete.

grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.

A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.

while read line; do
  grep -F "$line" fileA
done < fileBcutdown > outputfile

Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u , but this might be slower by quite a bit. You have to try.

while read line; do
  grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile

If you depend on the order of the lines, then I don't think you have any other option than using grep -f . But basically it boils down to trying m*n pattern matches.

使用以下命令:

awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM