简体   繁体   中英

Command line to match lines with matching first field (sed, awk, etc.)

What is fast and succinct way to match lines from a text file with a matching first field.

Sample input:

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output:

b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output, alternative:

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

I can imagine many ways to write this, but I suspect there's a smart way to do it, eg, with sed, awk, etc. My source file is approx 0.5 GB.

There are some related questions here, eg, " awk | merge line on the basis of field matching ", but that other question loads too much content into memory. I need a streaming method.

For fixed width fields you can used uniq :

$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

If you don't have fixed width fields here are two awk solution:

awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)

awk -F \| '
    $1 == prev_key {print prev_line; matches ++}
    $1 != prev_key {                            
        if (matches) print prev_line
        matches = 0
        prev_key = $1
    }                
    {prev_line = $0}
    END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

Alternate output

awk -F \| '
    $1 == prev_key {
        if (matches == 0) printf "%s", $1 
        printf "%s%s", FS, prev_value
        matches ++
    }             
    $1 != prev_key {
        if (matches) printf "%s%s\n", FS, prev_value
        matches = 0                                 
        prev_key = $1
    }                
    {prev_value = $2}
    END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Using awk:

awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
    END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor

This might work for you (GNU sed):

sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file

This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.

$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit

$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM