简体   繁体   English

用perl分割变化的字符串

[英]Splitting a changing string with perl

I have a bunch of strings in perl that all look like this: 我在perl中有一堆看起来像这样的字符串:

10 NE HARRISBURG
4 E HASWELL
2 SE OAKLEY
6 SE REDBIRD
PROVO
6 W EADS
21 N HARRISON

What I am needing to do is remove the numbers and the letters from before the city names. 我需要做的是删除城市名称前面的数字和字母。 The problem I am having is that it varies a lot from city to city. 我遇到的问题是各个城市之间的差异很大。 The data is almost never the same. 数据几乎从不相同。 Is it possible to remove this data and keep it in a separate string? 是否可以删除此数据并将其保存在单独的字符串中?

Try this: 尝试这个:

for my $s (@strings) {
    my @fields = split /\s+/, $s, 3;
    my $city = $fields[-1];
}

You can test the array size to determine the number of fields: 您可以测试数组大小以确定字段数:

my $n = @fields;
my @l = (
'10 NE HARRISBURG',
'4 E HASWELL',
'2 SE OAKLEY',
'6 SE REDBIRD',
'PROVO',
'6 W EADS',
'21 N HARRISON',
);

foreach(@l) {

according to hoobs i changed the regex 根据蹄我改变了正则表达式

    my($beg, $rest) = ($_ =~ /^(\d*\s(?:[NS]|[NS]?[EW])*)?(.*)$/);
    print "beg=$beg \trest=$rest\n";    
}

output: 输出:

beg=10 NE   rest=HARRISBURG
beg=4 E     rest=HASWELL
beg=2 SE    rest=OAKLEY
beg=6 SE    rest=REDBIRD
beg=    rest=PROVO
beg=6 W     rest=EADS
beg=21 N    rest=HARRISON

for shinjuo, if you want to run only one string you can do : 对于shinjuo,如果只想运行一个字符串,则可以执行以下操作:

  my($beg, $rest) = ($l[3] =~ /^(\d*\s(?:[NS]|[NS]?[EW])*)?(.*)$/);
  print "beg=$beg \trest=$rest\n";

and to avoid warning on uninitialized value you have to test if $beg is defined: 为了避免警告未初始化的值,您必须测试$ beg是否已定义:

print defined$beg?"beg=$beg\t":"", "rest=$rest\n";

Looks like you always want the very last element in the result of split(). 看起来您总是想要split()结果中的最后一个元素。 Or you can go with m/(\\S+)$/. 或者,您可以使用m /(\\ S +)$ /。

Can't we assume there is always a city name and that it appears last on a line? 我们不能假设总是有一个城市名称并且它出现在行的最后吗? If that's the case, split the line and keep the last portion of it. 如果是这种情况,请分割线并保留其最后一部分。 Here's a one liner command line solution: 这是一个划线员命令行解决方案:

perl -lne 'split ; print $_[-1]' input.txt

Output: 输出:

HARRISBURG
HASWELL
OAKLEY
REDBIRD
PROVO
EADS
HARRISON

Update 1 更新1

This solution won't work if you have composed city names like SAN FRANCISCO (case spotted in a comment below). 如果您撰写的城市名称如SAN FRANCISCO(案例在下面的注释中发现),则此解决方案将不起作用。

Where is your input data coming from? 您的输入数据来自哪里? If you have generated it yourself, you should add delimiters. 如果您自己生成了它,则应添加定界符。 If someone generated it for you, ask them to regenerate it with delimiters. 如果有人为您生成了它,请他们用定界符重新生成它。 Parsing it will then become child's play. 解析后将成为孩子的游戏。

# replace ";" for your delimiter
perl -lne 'split ";" ; print $_[-1]' input.txt

Regex Solution 正则表达式解决方案


Solution 1: Keep everything (vol7ron's emailed solution) 解决方案1:保留所有内容(vol7ron的电子邮件解决方案)


#!/usr/bin/perl -w    

use strict; 
use Data::Dumper;   

   sub main{    
      my @strings = (    
                      '10 NE HARRISBURG'    
                    , '4 E HASWELL'    
                    , '2 SE OAKLEY'    
                    , '6 SE REDBIRD'    
                    , 'PROVO'    
                    , '6 W EADS'    
                    , '21 N HARRISON'    
                    , '32 SAN FRANCISCO' 
                    , ''   
                    , '15 NEW YORK'    
                    , '15 NNW NEW YORK'    
                    , '15 NW NEW YORK'     
                    , 'NW NEW YORK'    
                    );       

      my %hash;
      my $count=0;
      for (@strings){    
         if (/\d*\s*[NS]{0,2}[EW]{0,1}\s+/){
            # if there was a speed / direction
            $hash{$count}{wind} = $&;
            $hash{$count}{city} = $';
         } else {
            # if there was only a city
            $hash{$count}{city} = $_;
         }
         $count++;
      }    

      print Dumper(\%hash);  
   }    

   main();  


Solution 2: Strip off what you don't need 解决方案2:剥离不需要的东西


#!/usr/bin/perl -w    

use strict;    

   sub main{    
      my @strings = (    
                      '10 NE HARRISBURG'    
                    , '4 E HASWELL'    
                    , '2 SE OAKLEY'    
                    , '6 SE REDBIRD'    
                    , 'PROVO'    
                    , '6 W EADS'    
                    , '21 N HARRISON'    
                    , '32 SAN FRANCISCO'    
                    , '15 NEW YORK'    
                    , '15 NNW NEW YORK'    
                    , '15 NW NEW YORK'     
                    , 'NW NEW YORK'     
                    );    

      for my $elem (@strings){    
         $elem =~ s/\d*\s*[NS]{0,2}[EW]{0,1}\s+(\w*)/$1/;    
      }    

      $"="\n";    
      print "@strings\n";        
   }    

   main();    

Update: 更新:

Making the changes with vol7ron 's suggestion and example, using the repetition operator worked. 使用重复操作符,根据vol7ron的建议和示例进行更改。 This will strip off leading digits and the direction and won't break if the digits or direction (or both) are missing. 这将去除前导数字和方向,并且如果数字或方向(或两者都缺失)不会中断。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM