简体   繁体   中英

How to output only the first n lines based on occurrences of a column string

I have a large file that contains ID numbers in the first column, followed by additional unique information in subsequent columns. Each ID number occurs multiple times in the file:

000112 3489 A-8 40984
000112 4743 A-7 94587
000112 5894 A-1 45795
000177 8347 A-2 54575
000177 5843 B-5 94342
000177 5684 A-4 76544
000177 6586 C-2 65834
000226 5679 C-2 85795
000226 5456 C-1 45876
000226 9899 A-2 56834

I would like to output a file containing lines for only the first two occurrences of each ID number:

000112 3489 A-8 40984
000112 4743 A-7 94587
000177 8347 A-2 54575
000177 5843 B-5 94342
000226 5679 C-2 85795
000226 5456 C-1 45876

Note that this data represents only a small portion of the input file, so a command that requires entering specific strings (ID numbers) is not what I'm looking for. Thanks!

awk 'a[$1]++ < 2' input-file

should do the trick. Just read the file and increment an array indexed by the value in the first column. If that value is less than 2, print the line. When you see the same id the 3rd time, the index in the array will be two and the output of that line will be suppressed.

This isn't pretty but it yields the desired output:

Step 1:

awk '!seen[$1]++' input.file > output1 

Step 2:

grep -v -F -f output1 input.file | awk '!seen[$1]++' > output2

Step 3:

cat output1 output2 | sort -k 1 > desired.output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM