简体   繁体   中英

Cast a string into a hash or array in perl

I am currently parsing a comma separated string of 2-tuples into a hash of scalars. For example, given the input:

"ip=192.168.100.1,port=80,file=howdy.php",

I end up with a hash that looks like:

%hash =
{
    ip => 192.168.100.1,
    port => 80,
    file => howdy.php
 }

Code works fine and looks something like this:

my $paramList = $1;
my @paramTuples = split(/,/, $paramList);
my %hash;
foreach my $paramTuple (@paramTuples) {
    my($key, $val) = split(/=/, $paramTuple, 2);
    $hash{$key} = $val;
}

I'd like to expand the functionality from just taking scalars to also take arrays and hashes. So, another example input could be:

"ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}",

I end up with a hash that looks like:

%hash =
{
    ips => (192.168.100.1, 192.168.100.2), # <--- this is an array
    port => 80,
    file => howdy.php,
    hashthing => { key1 => val1, key2 => val2 } # <--- this is a hash
 }

I know I can parse the input string character by character. For each tuple I would do the following: If the first character is a ( then parse an array. Else, if the first character is a { then parse a hash. Else parse a scalar.

A co-worker of mine indicated he thought you could turn a string that looked like "(red,yellow,blue)" into an array or "{c1 => red, c2 => yellow, c3 => blue}" into a hash with some kind of cast function. If I went this route, I could use a different delimiter instead of a comma to separate my 2-tuples like a | .

Is this possible in perl?

I think the "cast" function you're referring to, might be eval .

Using eval

use strict;
use warnings;
use Data::Dumper;

my $string = "{ a => 1, b => 2, c => 3}";
my $thing =  eval $string;
print "thing is a ", ref($thing),"\n";
print Dumper $thing;

Will print:

thing is a HASH
$VAR1 = {
            'a' => 1,
            'b' => 2,
            'c' => 3
          };

Or for arrays:

my $another_string = "[1, 2, 3 ]";
my  $another_thing = eval $another_string;
print "another_thing is ", ref ( $another_thing ), "\n";
print Dumper $another_thing;

another_thing is ARRAY
$VAR1 = [
            1,
            2,
            3
          ];

Although note that eval requires you to use brackets suitable for the appropriate data types - {} for anon hashes, and [] for anon arrays. So to take your example above:

my %hash4;
my $ip_string = "ips=[192.168.100.1,192.168.100.2]";
my ( $key, $value ) = split ( /=/, $ip_string );
$hash4{$key} = eval $value; 

my $hashthing_string = "{ key1 => 'val1', key2 => 'val2' }"; 
$hash4{'hashthing'} = eval $hashthing_string;
print Dumper \%hash4;

Gives:

$VAR1 = {
      'hashthing' => {
                       'key2' => 'val2',
                       'key1' => 'val1'
                     },
      'ips' => [
                 192.168.100.1,
                 192.168.100.2
               ]
    };

Using map to make an array into a hash

If you want to turn an array into a hash, the map function is for that.

my @array = ( "red", "yellow", "blue" );
my %hash = map { $_ => 1 } @array; 
print Dumper \%hash;

Using slices of hashes

You can also use a slice if you have known values and known keys:

my @keys = ( "c1", "c2", "c3" );
my %hash2;
@hash2{@keys} = @array;
print Dumper \%hash2;

JSON / XML

Or if you have control over the export mechanism, you may find exporting as JSON or XML format would be a good choice, as they're well defined standards for 'data as text'. (You could perhaps use Perl's Storable too, if you're just moving data between Perl processes).

Again, to take the %hash4 above (with slight modifications, because I had to quote the IPs):

use JSON; 
print encode_json(\%hash4);

Gives us:

{"hashthing":{"key2":"val2","key1":"val1"},"ips":["192.168.100.1","192.168.100.2"]}

Which you can also pretty-print:

use JSON; 
print to_json(\%hash4, { pretty => 1} );

To get:

{
   "hashthing" : {
      "key2" : "val2",
      "key1" : "val1"
   },
   "ips" : [
      "192.168.100.1",
      "192.168.100.2"
   ]
}

This can be read back in with a simple:

my $data_structure = decode_json ( $input_text ); 

Style point

As a point of style - can I suggest that the way you've formatted your data structures isn't ideal. If you 'print' them with Dumper then that's a common format that most people will recognise. So your 'first hash' looks like:

Declared as (not - my prefix, and () for the declaration, as well as quotes required under strict ):

my %hash3 = (
    "ip" => "192.168.100.1",
    "port" => 80,
    "file" => "howdy.php"
);

Dumped as (brackets of {} because it's an anonymous hash, but still quoting strings):

$VAR1 = {
          'file' => 'howdy.php',
          'ip' => '192.168.100.1',
          'port' => 80
        };

That way you'll have a bit more joy with people being able to reconstruct and interpret your code.

Note too - that the dumper style format is also suitable (in specific limited cases) for re-reading via eval .

Try this but compound values will have to be parsed separately.

my $qr_key_1 = qr{
  (         # begin capture
    [^=]+   # equal sign is separator. NB: spaces captured too.
  )         # end capture
}msx;

my $qr_value_simple_1 = qr{
  (         # begin capture
    [^,]+   # comma is separator. NB: spaces captured too.
  )         # end capture
}msx;

my $qr_value_parenthesis_1 = qr{
  \(        # starts with parenthesis
  (         # begin capture
    [^)]+   # end with parenthesis NB: spaces captured too.
  )         # end capture
  \)        # end with parenthesis
}msx;

my $qr_value_brace_1 = qr{
  \{        # starts with brace
  (         # begin capture
    [^\}]+  # end with brace NB: spaces captured too.
  )         # end capture
  \}        # end with brace
}msx;

my $qr_value_3 = qr{
  (?:       # group alternative
    $qr_value_parenthesis_1
  |         # or other value
    $qr_value_brace_1
  |         # or other value
    $qr_value_simple_1
  )         # end group
}msx;

my $qr_end = qr{
  (?:       # begin group
    \,      # ends in comma
  |         # or
    \z      # end of string
  )         # end group
}msx;

my $qr_all_4 = qr{
  $qr_key_1     # capture a key
  \=            # separates key from value(s)
  $qr_value_3   # capture a value
  $qr_end       # end of key-value pair
}msx;



while( my $line = <DATA> ){
  print "\n\n$line";  # for demonstration; remove in real script
  chomp $line;

  while( $line =~ m{ \G $qr_all_4 }cgmsx ){
    my $key = $1;
    my $value = $2 || $3 || $4;

    print "$key = $value\n";  # for demonstration; remove in real script
  }
}

__DATA__
ip=192.168.100.1,port=80,file=howdy.php
ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}

Addendum:

The reason why it is so difficult to expand the parse is, in one word, context. The first line of data, ip=192.168.100.1,port=80,file=howdy.php is context free. That is, all the symbols in it do not change their meaning. Context-free data format can be parsed with regular expressions alone.

Rule #1: If the symbols denoting the data structure never change, it is a context-free format and regular expressions can parse it.

The second line, ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2} is a different issue. The meaning of the comma and equal sign changes.

Now, you're thinking the comma doesn't change; it still separates things, doesn't it? But it changes what it separates. That is why the second line is more difficult to parse. The second line has three contexts, in a tree:

main context
+--- list context
+--- hash context

The tokienizer must switch parsing sets as the data switches context. This requires a state machine.

Rule #2: If the contexts of the data format form a tree, then it requires a state machine and different parsers for each context. The state machine determines which parser is in use. Since every context except the root have only one parent, the state machine can switch back to the parent at the end of its current context.

And this is the last rule, for completion sake. It is not used in this problem.

Rule #3: If the contexts form a DAG (directed acyclic graph) or a recursive (aka cyclic) graph, then the state machine requires a stack so it will know which context to switch back to when it reaches the end of the current context.

Now, you may have notice that there is no state machine in the above code. It's there but it's hidden in the regular expressions. But hiding it has a cost: the list and hash contexts are not parsed. Only their strings are found. They have to be parsed separately.

Explanation:

The above code uses the qr// operator to create the parsing regular expression. The qr// operator compiles a regular expression and returns a reference to it. This reference can be used in a match, substitute, or another qr// expression. Think of each qr// expression as a subroutine. Just like normal subroutines, qr// expressions can be used in other qr// expressions, building up complex regular expressions from simpler ones.

The first expression, $qr_key_1 , captures the key name in the main context. Since the equal sign separates the key from the value, it captures all non-equal-sign characters. The "_1" on the end of the variable name is what I use to remind myself that one capture group is present.

The options on the end of the expression, /m , /s , and /x , are recommended in Perl Best Practices but only the /x option has an effect. It allows spaces and comments in the regular expression.

The next expression, $qr_value_simple_1 , captures simple values for the key.

The next one, $qr_value_parenthesis_1 , handles the list context. This is possible only because a closing parenthesis has only one meaning: end of list context. But is also has a price: the list is not parsed; only its string is found.

And again for $qr_value_brace_1 : the closing brace has only one meaning. And the hash is also not parsed.

The $qr_value_3 expression combines the value REs into one. The $qr_value_simple_1 must be last but the others can be in any order.

The $qr_end parses the end of a field in the main context. There is no number at its end because it does not capture anything.

And finally, $qr_all_4 puts them all together to create the RE for data.

The RE used in the inner loop, m{ \\G $qr_all_4 }cgmsx , parses out each field in the main context. The \\G assertion means: if the has been changed since the last call (or it has never been called), then start the match at the beginning of the string; otherwise, start where the last match finished. This is used in conjunction with the /c and /g``options to parse each field out from the $line`, one at a time for processing inside the loop.

And that is briefly what is happening inside the code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM