简体   繁体   中英

Perl regex matching numbers

I'm working on a Perl assignment. One of the requirements is to match all integer and float numbers except those in comments or strings (double or single quoted).

Here is my assumption:

  • Optional sign, integer, and fraction.
  • If the integer part is omitted, the fraction is mandatory.
  • If the fraction part is omitted, the decimal dot must be omitted.

And here is the regex I found.

([-+]?([0-9]+(\.[0-9]+)?|\.[0-9]+))

Here is my block of code, I had trouble to exclude numbers in comments and strings so I remove all comments and strings first. I also split lines into words, I believe this should be easier. But I also believe this should not be necessary.

while (<$IN_FILE>) {
    s/^(#[^!]+$)//;            # remove whole line comments
    s{(^[^#]+?)(#[^/]+$)}{$1}; # remove inline comments
    s/('.*?'|".*?")//g;        # remove all single line strings
    push @words, split;        # split line into words
  }

  foreach my $item (<@words>) {
    push @numbers, $1 if $item =~ /([-+]?([0-9]+(\.[0-9]+)?|\.[0-9]+))/;
  }

It worked ok but failed to match array index like the 0 in ARGV[0] .

So I need some help to improve my code, it would be nice if I don't have to remove comments, strings first, don't need to split lines into words, and of course match all the numbers not in comments and strings.

Simple input

# Comment 1
my $time = <STDIN>;
chomp $time;
   #now write input to STDOUT
print $time . "\n";
my $pi = 3.1415926;
my $test = -3.22;
my $t = +0.01;
my $range = (8..11);
if $ARGV[0];
sub sample2 {
   print "true or false";
   return 3 + 4 eq "7"; # true or false
}

Here is the output from my code it missed 0 in ARGV[0] and 11 in (8..11) . I won't be surprised if it misses more.

[Numbers]
3.1415926
-3.22
+0.01
8
2
3
4

The main problem is here:

foreach my $item (<@words>) {

You want to iterate over @words, so no <> are needed. They turn into glob which changes the list you want to iterate over. Just insert

warn "\t$item\n"

into the last loop to see what's being processed.

Even after fixing this, (8..11) will be tokenized into one "word". You match without any /g , so you cannot get more than one number from an item.

As choroba already pointed out, your use of <@words> is an obvious bug.

However, you should simplify things by not breaking your lines into words in the first place and instead use /g to match

use strict;
use warnings;

my @numbers;
while (<DATA>) {
    s/^(#[^!]+$)//;            # remove whole line comments
    s{(^[^#]+?)(#[^/]+$)}{$1}; # remove inline comments
    s/('.*?'|".*?")//g;        # remove all single line strings

    while (/([-+]?([0-9]+(\.[0-9]+)?|\.[0-9]+))/g) {
        push @numbers, $1;
    }
}

print "@numbers";

__DATA__
# Comment 1
my $time = <STDIN>;
chomp $time;
   #now write input to STDOUT
print $time . "\n";
my $pi = 3.1415926;
my $test = -3.22;
my $t = +0.01;
my $range = (8..11);
if $ARGV[0];
sub sample2 {
   print "true or false";
   return 3 + 4 eq "7"; # true or false
}

This will end up pulling too many results. One solution is to add a word boundary before the numbers in the regex:

while (/([-+]?\b([0-9]+(\.[0-9]+)?|\.[0-9]+))\b/g) {

Outputs:

3.1415926 -3.22 +0.01 8 11 0 3 4

The best way to accomplish this is by using PPI though. That is definitely outside of the scope of what your professor is trying to teach you, but to demonstrate:

use strict;
use warnings;

use PPI;

my $src = do {local $/; <DATA>};

# Load a document
my $doc = PPI::Document->new( \$src );

# Find all the barewords within the doc
my $nums = $doc->find( 'PPI::Token::Number' );
for (@$nums) {
    print $_->content, "\n";
}

__DATA__
# Comment 1
my $time = <STDIN>;
chomp $time;
   #now write input to STDOUT
print $time . "\n";
my $pi = 3.1415926;
my $test = -3.22;
my $t = +0.01;
my $range = (8..11);
if $ARGV[0];
sub sample2 {
   print "true or false";
   return 3 + 4 eq "7"; # true or false
}

Outputs:

3.1415926
-3.22
0.01
8
11
0
3
4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM