简体   繁体   中英

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.

Can someone please guide? Note: Not with the full program, But where I can make improvement.

Thanks in advance

Input :

$str = "mail1, local<mail1@mail.local>, mail2@mail.local, <mail3@mail.local>, mail4 local<mail4@mail.local>"

Expected Output :

mail1, local<mail1@mail.local>
mail2@mail.local
<mail3@mail.local>
mail4, local<mail4@mail.local>

Sample Program

my $str="mail1, \@local<mail1\@mail.local>, mail2\@mail.local, <mail3\@mail.local>, mail4, local<mail4\@mail.local>";
my $count=0, @array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
    if( ($count eq 0) and ($c eq ' ') ) {
        next;
    }
    if($c) {
        if( ($c eq ',') and ($flag eq 1) ) {
            push @array, $tempStr;
            $count=0;
            $flag1=0;
            $tempStr="";
            next;
        }
        if( ($c eq '>' ) or ( $c eq '@' ) ) {
            $flag=1;
        }
        $tempStr="$tempStr$c";
        $count++;
    }
}
if($count>0) {
    push @array, $tempStr;
}
foreach my $var (@array) {
    print "$var\n";
}

Edit:

Input:

Input is the output of above code.

Expected Output :

"mail1, local"<mail1@mail.local>
"mail4, local"<mail4@mail.local>

Sample Code:

$str =~ s/([^@>]+[@>][^,]+),\s*/$1\n/g;
my @addresses = split('\n',$str);
if(scalar @addresses) {
    foreach my $address (@addresses) {
        if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
            $address="\"$address";
            $address=~ s/</\"</g;
        }
    }
    $str = join(',',@addresses);
}
print "$str\n";

As I see, you want to replace each:

  • comma and following spaces,
  • occurring after either @ or > ,

with a newline.

To make such replacement, instead of writing a parsing program, you can use a regex.

The search part can be as follows:

([^@>]+[@>][^,]+),\s*

Details:

  • ( - Start of the 1st capturing group.
    • [^@>]+ - A non-empty sequence of chars other than @ or > .
    • [@>] - Either @ or > .
    • [^,]+ - A non-empty sequence of chars other than a comma.
  • ) - End of the 1st capturing group.
  • ,\\s* - A comma and optional sequence of spaces.

The replace part should be:

  • $1 - The 1st capturing group.
  • \\n - A newline.

So the whole program, much shorter than yours, can be as follows:

my $str='mail1, local<mail1@mail.local>, mail2@mail.local, <mail3@mail.local>, mail4, local<mail4@mail.local>';
print "Before:\n$str\n";
$str =~ s/([^@>]+[@>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";

To replace all needed commas I used g option.

Note that I put the source string in single quotes, otherwise Perl would have complained about Possible unintended interpolation of @mail .

Edit

Your modified requirements must be handled different way. "Ordinary" replacement is not an option, because now there are some fragments to match and some framents to ignore .

So the basic idea is to write a while loop with a matching regex: (\\w+),?\\s+(\\w+)(<[^>]+>) , meaning:

  • (\\w+) - First capturing group - a sequence of word chars (eg mail1 ).
  • ,?\\s+ - Optional comma and a sequence of spaces.
  • (\\w+) - Second capturing group - a sequence of word chars (eg local ).
  • (<[^>]+>) - Third capturing group - a sequence of chars other than > (actual mail address), enclosed in angle brackets, eg <mail1@mail.local> .

Within each execution of the loop you have access to the groups captured in this particular match ( $1 , $2 , ...).

So the content of this loop is to print all these captured groups, with required additional chars.

The code (again much shorter than yours) should look like below:

my $str = 'mail1, local<mail1@mail.local>, mail2@mail.local, <mail3@mail.local>, mail4 local<mail4@mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
  print "\"$1, $2\"$3\n";
}

Here is an approach using split , which in this case also needs a careful regex

use warnings;
use strict;
use feature 'say';

my $string =   # broken into two parts for readabililty
    q(mail1, local<mail1@mail.local>, mail2@mail.local, )
 .  q(<mail3@mail.local>, mail4, local<mail4@mail.local>);

my @addresses = split /@.+?\K,\s*/, $string;

say for @addresses;

The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so @.+?,

To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.

We can instead normally match the pattern @.+? and then use the \\K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\\s* when that is preceded by the email address, @... (what isn't consumed).

It prints

mail1, local<mail1@mail.local>
mail2@mail.local
<mail3@mail.local>
mail4, local<mail4@mail.local>

The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example

my @addresses = split /@.+?\K,\s*/, $string;   #/ stop syntax highlight

s/(.+?,\s*.+?)</"$1"</  for @addresses;

say for @addresses;

The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.

In a foreach loop the index variable (or $_ ) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.

The statement also uses the statement modifier and it is equivalent to

foreach my $elem (@addresses) {
    $elem =~ s/(.+?,\s*.+?)</"$1"</;
}

This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM