简体   繁体   中英

Splitting a changing string with perl

I have a bunch of strings in perl that all look like this:

10 NE HARRISBURG
4 E HASWELL
2 SE OAKLEY
6 SE REDBIRD
PROVO
6 W EADS
21 N HARRISON

What I am needing to do is remove the numbers and the letters from before the city names. The problem I am having is that it varies a lot from city to city. The data is almost never the same. Is it possible to remove this data and keep it in a separate string?

Try this:

for my $s (@strings) {
    my @fields = split /\s+/, $s, 3;
    my $city = $fields[-1];
}

You can test the array size to determine the number of fields:

my $n = @fields;
my @l = (
'10 NE HARRISBURG',
'4 E HASWELL',
'2 SE OAKLEY',
'6 SE REDBIRD',
'PROVO',
'6 W EADS',
'21 N HARRISON',
);

foreach(@l) {

according to hoobs i changed the regex

    my($beg, $rest) = ($_ =~ /^(\d*\s(?:[NS]|[NS]?[EW])*)?(.*)$/);
    print "beg=$beg \trest=$rest\n";    
}

output:

beg=10 NE   rest=HARRISBURG
beg=4 E     rest=HASWELL
beg=2 SE    rest=OAKLEY
beg=6 SE    rest=REDBIRD
beg=    rest=PROVO
beg=6 W     rest=EADS
beg=21 N    rest=HARRISON

for shinjuo, if you want to run only one string you can do :

  my($beg, $rest) = ($l[3] =~ /^(\d*\s(?:[NS]|[NS]?[EW])*)?(.*)$/);
  print "beg=$beg \trest=$rest\n";

and to avoid warning on uninitialized value you have to test if $beg is defined:

print defined$beg?"beg=$beg\t":"", "rest=$rest\n";

Looks like you always want the very last element in the result of split(). Or you can go with m/(\\S+)$/.

Can't we assume there is always a city name and that it appears last on a line? If that's the case, split the line and keep the last portion of it. Here's a one liner command line solution:

perl -lne 'split ; print $_[-1]' input.txt

Output:

HARRISBURG
HASWELL
OAKLEY
REDBIRD
PROVO
EADS
HARRISON

Update 1

This solution won't work if you have composed city names like SAN FRANCISCO (case spotted in a comment below).

Where is your input data coming from? If you have generated it yourself, you should add delimiters. If someone generated it for you, ask them to regenerate it with delimiters. Parsing it will then become child's play.

# replace ";" for your delimiter
perl -lne 'split ";" ; print $_[-1]' input.txt

Regex Solution


Solution 1: Keep everything (vol7ron's emailed solution)


#!/usr/bin/perl -w    

use strict; 
use Data::Dumper;   

   sub main{    
      my @strings = (    
                      '10 NE HARRISBURG'    
                    , '4 E HASWELL'    
                    , '2 SE OAKLEY'    
                    , '6 SE REDBIRD'    
                    , 'PROVO'    
                    , '6 W EADS'    
                    , '21 N HARRISON'    
                    , '32 SAN FRANCISCO' 
                    , ''   
                    , '15 NEW YORK'    
                    , '15 NNW NEW YORK'    
                    , '15 NW NEW YORK'     
                    , 'NW NEW YORK'    
                    );       

      my %hash;
      my $count=0;
      for (@strings){    
         if (/\d*\s*[NS]{0,2}[EW]{0,1}\s+/){
            # if there was a speed / direction
            $hash{$count}{wind} = $&;
            $hash{$count}{city} = $';
         } else {
            # if there was only a city
            $hash{$count}{city} = $_;
         }
         $count++;
      }    

      print Dumper(\%hash);  
   }    

   main();  


Solution 2: Strip off what you don't need


#!/usr/bin/perl -w    

use strict;    

   sub main{    
      my @strings = (    
                      '10 NE HARRISBURG'    
                    , '4 E HASWELL'    
                    , '2 SE OAKLEY'    
                    , '6 SE REDBIRD'    
                    , 'PROVO'    
                    , '6 W EADS'    
                    , '21 N HARRISON'    
                    , '32 SAN FRANCISCO'    
                    , '15 NEW YORK'    
                    , '15 NNW NEW YORK'    
                    , '15 NW NEW YORK'     
                    , 'NW NEW YORK'     
                    );    

      for my $elem (@strings){    
         $elem =~ s/\d*\s*[NS]{0,2}[EW]{0,1}\s+(\w*)/$1/;    
      }    

      $"="\n";    
      print "@strings\n";        
   }    

   main();    

Update:

Making the changes with vol7ron 's suggestion and example, using the repetition operator worked. This will strip off leading digits and the direction and won't break if the digits or direction (or both) are missing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM