I have a file(tab delimited) with 6 columns (here I have shown 2 columns for simplicity)
46_#1 A
47_#1 B
49_#1 C
51_#1 D
51_#1 E
I want to count duplicates in first column (only count-no removal) and store count in next column. So output should be-
46_#1 1 A
47_#1 1 B
49_#1 1 C
51_#1 2 D
51_#1 2 E
I have used linux command-
uniq -c file
but this will take whole line (not 1st column) then I used
uniq -c -w5 file
But word count in first column can vary.
Can anyone help please?
PS- I have a very big file (around 1gb).
I don't like just providing complete solutions, but it seemed the easiest way to explain. This program reads through the file twice: first to accumulate the frequency information and then to output the modified data.
use strict;
use warnings;
@ARGV or die "No input file specified";
open my $fh, '<', $ARGV[0] or die "Unable to open input file: $!";
my %count;
while (<$fh>) {
next unless my ($key) = split;
$count{$key}++;
}
seek $fh, 0, 0;
while (<$fh>) {
chomp;
next unless my ($key, $rest) = split ' ', $_, 2;
print "$key $count{$key} $rest\n";
}
Assuming that the file is sorted you can simple comands to do it:
sorin@sorin: $ join -1 1 -2 2 -o1.1,2.1,1.2 sample.txt <(cut -f1 sample.txt | uniq -c)
46_#1 1 A
47_#1 1 B
49_#1 1 C
51_#1 2 D
51_#1 2 E
-1 1 -2 2
joins based on first column from the first file and second column of the second file -o1.1,2.1,1.2
selects which columns to output <()
process substitution - the output of the process is turned into a input file
Note: If the files aren't sorted,
it's probably better to use the previous answer, as I see from your comment that the duplicates are far away
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.