Counting duplicates in a specific column using perl/linux

Question

I have a file(tab delimited) with 6 columns (here I have shown 2 columns for simplicity)

46_#1   A   
47_#1   B   
49_#1   C   
51_#1   D   
51_#1   E

I want to count duplicates in first column (only count-no removal) and store count in next column. So output should be-

46_#1   1  A    
47_#1   1  B    
49_#1   1  C    
51_#1   2  D    
51_#1   2  E

I have used linux command-

uniq -c  file

but this will take whole line (not 1st column) then I used

uniq -c -w5 file

But word count in first column can vary.

Can anyone help please?

PS- I have a very big file (around 1gb).

Answer 1

I don't like just providing complete solutions, but it seemed the easiest way to explain. This program reads through the file twice: first to accumulate the frequency information and then to output the modified data.

use strict;
use warnings;

@ARGV or die "No input file specified";

open my $fh, '<', $ARGV[0] or die "Unable to open input file: $!";

my %count;

while (<$fh>) {
  next unless my ($key) = split;
  $count{$key}++;
}

seek $fh, 0, 0;
while (<$fh>) {
  chomp;
  next unless my ($key, $rest) = split ' ', $_, 2;
  print "$key $count{$key} $rest\n";
}

Answer 2

Assuming that the file is sorted you can simple comands to do it:

sorin@sorin: $ join -1 1 -2 2 -o1.1,2.1,1.2 sample.txt <(cut -f1 sample.txt | uniq -c)
46_#1 1 A
47_#1 1 B
49_#1 1 C
51_#1 2 D
51_#1 2 E

join - join files based on common fields
- -1 1 -2 2 joins based on first column from the first file and second column of the second file
- -o1.1,2.1,1.2 selects which columns to output
- <() process substitution - the output of the process is turned into a input file
- join ignores leading spaces from uniq's output
cut - extracts only one field

Note: If the files aren't sorted, it's probably better to use the previous answer, as I see from your comment that the duplicates are far away

Counting duplicates in a specific column using perl/linux

Question

2 answers

solution1
5 ACCPTED 2012-01-27 12:27:43

solution2
0 2012-01-27 13:10:18

Counting duplicates in a specific column using perl/linux

Question

2 answers

solution1 5 ACCPTED 2012-01-27 12:27:43

solution2 0 2012-01-27 13:10:18

solution1
5 ACCPTED 2012-01-27 12:27:43

solution2
0 2012-01-27 13:10:18