Command line to match lines with matching first field (sed, awk, etc.)

Question

What is fast and succinct way to match lines from a text file with a matching first field.

Sample input:

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output:

b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output, alternative:

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

I can imagine many ways to write this, but I suspect there's a smart way to do it, eg, with sed, awk, etc. My source file is approx 0.5 GB.

There are some related questions here, eg, " awk | merge line on the basis of field matching ", but that other question loads too much content into memory. I need a streaming method.

Answer 1

For fixed width fields you can used uniq :

$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

If you don't have fixed width fields here are two awk solution:

awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Answer 2

Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)

awk -F \| '
    $1 == prev_key {print prev_line; matches ++}
    $1 != prev_key {                            
        if (matches) print prev_line
        matches = 0
        prev_key = $1
    }                
    {prev_line = $0}
    END { if (matches) print $0 }
' filename

b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

Alternate output

awk -F \| '
    $1 == prev_key {
        if (matches == 0) printf "%s", $1 
        printf "%s%s", FS, prev_value
        matches ++
    }             
    $1 != prev_key {
        if (matches) printf "%s%s\n", FS, prev_value
        matches = 0                                 
        prev_key = $1
    }                
    {prev_value = $2}
    END {if (matches) printf "%s%s\n", FS, $2}
' filename

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Answer 3

Using awk:

awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
    END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor

Answer 4

This might work for you (GNU sed):

sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file

This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.

Answer 5

$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Command line to match lines with matching first field (sed, awk, etc.)

Question

5 answers

solution1
3 2013-08-28 16:31:33

solution2
3 ACCPTED 2013-08-28 16:40:20

solution3
1 2013-08-28 16:35:15

solution4
1 2013-08-28 19:49:11

solution5
0 2013-08-28 17:02:44

Command line to match lines with matching first field (sed, awk, etc.)

Question

5 answers

solution1 3 2013-08-28 16:31:33

solution2 3 ACCPTED 2013-08-28 16:40:20

solution3 1 2013-08-28 16:35:15

solution4 1 2013-08-28 19:49:11

solution5 0 2013-08-28 17:02:44

solution1
3 2013-08-28 16:31:33

solution2
3 ACCPTED 2013-08-28 16:40:20

solution3
1 2013-08-28 16:35:15

solution4
1 2013-08-28 19:49:11

solution5
0 2013-08-28 17:02:44