How to output only the first n lines based on occurrences of a column string

Question

I have a large file that contains ID numbers in the first column, followed by additional unique information in subsequent columns. Each ID number occurs multiple times in the file:

000112 3489 A-8 40984
000112 4743 A-7 94587
000112 5894 A-1 45795
000177 8347 A-2 54575
000177 5843 B-5 94342
000177 5684 A-4 76544
000177 6586 C-2 65834
000226 5679 C-2 85795
000226 5456 C-1 45876
000226 9899 A-2 56834

I would like to output a file containing lines for only the first two occurrences of each ID number:

000112 3489 A-8 40984
000112 4743 A-7 94587
000177 8347 A-2 54575
000177 5843 B-5 94342
000226 5679 C-2 85795
000226 5456 C-1 45876

Note that this data represents only a small portion of the input file, so a command that requires entering specific strings (ID numbers) is not what I'm looking for. Thanks!

Answer 1

awk 'a[$1]++ < 2' input-file

should do the trick. Just read the file and increment an array indexed by the value in the first column. If that value is less than 2, print the line. When you see the same id the 3rd time, the index in the array will be two and the output of that line will be suppressed.

Answer 2

This isn't pretty but it yields the desired output:

Step 1:

awk '!seen[$1]++' input.file > output1

Step 2:

grep -v -F -f output1 input.file | awk '!seen[$1]++' > output2

Step 3:

cat output1 output2 | sort -k 1 > desired.output

How to output only the first n lines based on occurrences of a column string

Question

2 answers

solution1
1 ACCPTED 2017-03-31 19:03:43

solution2
0 2017-03-31 18:38:33

How to output only the first n lines based on occurrences of a column string

Question

2 answers

solution1 1 ACCPTED 2017-03-31 19:03:43

solution2 0 2017-03-31 18:38:33

solution1
1 ACCPTED 2017-03-31 19:03:43

solution2
0 2017-03-31 18:38:33