Scalable way of deleting all lines from a file where the line starts with one of many values

Question

Given an input file of variable values (example):

A
B
D

What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:

A
B
C
D

Would end up being:

The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.

Answer 1

awk '

    NR==FNR {     # IF this is the first file in the arg list THEN
        list[$0]  #     store the contents of the current record as an index or array "list"
        next      #     skip the rest of the script and so move on to the next input record
    }             # ENDIF

    {                                # This MUST be the second file in the arg list
        for (i in list)              # FOR each index "i" in array "list" DO
            if (index($0,i) == 1)    #     IF "i" starts at the 1st char on the current record THEN
                next                 #         move on to the next input record
     }

     1  # Specify a true condition and so invoke the default action of printing the current record.

' file1 file2

An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, eg:

...
list = list "|" $0
...

and then doing an RE comparison:

...
if ($0 ~ list)
    next
...

but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.

If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:

awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2

Answer 2

You can use comm to display the lines that are not common to both files, like this:

comm -3 file1 file2

Will print:

Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using

comm -3 <(sort file1) <(sort file2)

Answer 3

You can also achieve this using egrep :

egrep -vf <(sed 's/^/^/' file1) file2

Lets see it in action:

$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB

This would remove lines that start with one of the values in file1.

Scalable way of deleting all lines from a file where the line starts with one of many values

Question

3 answers

solution1
3 2013-07-11 14:10:57

solution2
1 2013-07-11 14:23:54

solution3
1 2013-07-11 14:28:58

Scalable way of deleting all lines from a file where the line starts with one of many values

Question

3 answers

solution1 3 2013-07-11 14:10:57

solution2 1 2013-07-11 14:23:54

solution3 1 2013-07-11 14:28:58

solution1
3 2013-07-11 14:10:57

solution2
1 2013-07-11 14:23:54

solution3
1 2013-07-11 14:28:58