简体   繁体   中英

Remove repeated lines in a file based on pattern

I've tried to find a good way to carry out this, but unfortunatly I didn't find one.

I'm working with files with this format:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

As you can see, every SPEC line is different, except the last one, where number of the string spectrum is repeated. What I'd like to do is take every chunk of information between the pattern =Cluster= and check if there are lines with spectrum value repeated. In case there are several lines repeated, removes all except one.

The output file should be like this:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

I'm using this to split the file using the pattern but I don't know how to check if there are spectrum repeated.

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?==Cluster=)/)) {
      open(O, '>temp' . ++$n);
      print O $match;
      close(O);
}

PD: I used Perl because it's easier for me, but I understand python as well.

Something like this will remove duplicate lines (globally across the file).

#!/usr/bin/perl

use warnings;
use strict;

my %seen; 

while ( <> ) {
  next if ( m/SPEC/ and $seen{$_}++ );
  print;
}

If you want to be more specific about the spectrum value, for example:

next if ( m/spectrum=(\d+)/ and $seen{$1}++ );

As you're splitting out your clusters, you can do something quite similar, but just:

  if ( $line =~ m/==Cluster==/ ) { 
     open ( $output, ">", "temp".$count++ ); 
     select $output;
  }

This sets the default 'print' location to $output (you'll need to declare it outside your loop too.

You should also:

  • use strict; use warnings;
  • Avoid reading <> into $_ , it's unnecessary. But it'd generally be better if you had to, to $block = do { local $/; <> }; $block = do { local $/; <> }; instead. And then $block =~ m/regex/
  • Use lexical file handles: open ( my $output, '>', 'filename' ) or die $!;
  • check your return code on open ( or die $! is usually sufficient).

So that would be something like:

#!/usr/bin/perl

use warnings;
use strict;

my %seen; 
my $count = 0; 
my $output; 

while (  <> ) {
  next if ( m/spectrum=(\d+)/ and $seen{$1}++ );
  if ( m/==Cluster==/ ) { 
     open ( $output, ">", "temp".$count++ ) or die $!; 
     select $output;
  }
  print;
}

You can, also, use this python script in which i used groupby from itertools module.

I assume your input file is called f_input.txt and the output file is called new_file.txt .

from itertools import groupby

data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r'))
final = list(k for k,_ in groupby(list(data)))

with open("new_file.txt", 'a') as f:
    for k in final:
        if k == ['','']:
            f.write("=Cluster=\n")
        elif k == ['']:
            # write '\n\n' in Windows and '\n' in Linux (tested only in Windows!)
            f.write("\n\n")
        else:
            f.write("{}\n".join(k))

The output file new_file.txt will be similar to your desired output.

If duplicate lines are consecutive, you could use this perl oneliner:

perl -ani.back -e 'next if defined($p) && $_ eq $p;$p=$_;print' file.txt 

The original file is backup with extension .back

该任务似乎很简单,不需要perl / python:使用uniq命令删除相邻的重复行:

$ uniq < input.txt > output.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM