简体   繁体   English

将字符串转换为perl中的散列或数组

[英]Cast a string into a hash or array in perl

I am currently parsing a comma separated string of 2-tuples into a hash of scalars. 我目前正在将逗号分隔的2元组字符串解析为标量的散列。 For example, given the input: 例如,给定输入:

"ip=192.168.100.1,port=80,file=howdy.php", “IP = 192.168.100.1,端口= 80,文件= howdy.php”,

I end up with a hash that looks like: 我最终得到一个看起来像这样的哈希:

%hash =
{
    ip => 192.168.100.1,
    port => 80,
    file => howdy.php
 }

Code works fine and looks something like this: 代码工作正常,看起来像这样:

my $paramList = $1;
my @paramTuples = split(/,/, $paramList);
my %hash;
foreach my $paramTuple (@paramTuples) {
    my($key, $val) = split(/=/, $paramTuple, 2);
    $hash{$key} = $val;
}

I'd like to expand the functionality from just taking scalars to also take arrays and hashes. 我想扩展功能,从仅使用标量扩展到阵列和哈希。 So, another example input could be: 因此,另一个示例输入可能是:

"ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}",

I end up with a hash that looks like: 我最终得到一个看起来像这样的哈希:

%hash =
{
    ips => (192.168.100.1, 192.168.100.2), # <--- this is an array
    port => 80,
    file => howdy.php,
    hashthing => { key1 => val1, key2 => val2 } # <--- this is a hash
 }

I know I can parse the input string character by character. 我知道我可以按字符解析输入字符串。 For each tuple I would do the following: If the first character is a ( then parse an array. Else, if the first character is a { then parse a hash. Else parse a scalar. 对于每个元组,我将执行以下操作:如果第一个字符是a (然后解析数组。否则,如果第一个字符是{然后解析散列。否则解析标量。

A co-worker of mine indicated he thought you could turn a string that looked like "(red,yellow,blue)" into an array or "{c1 => red, c2 => yellow, c3 => blue}" into a hash with some kind of cast function. 我的一位同事表示他认为你可以将一个看起来像"(red,yellow,blue)"的字符串变成一个数组或"{c1 => red, c2 => yellow, c3 => blue}"具有某种强制转换功能的哈希。 If I went this route, I could use a different delimiter instead of a comma to separate my 2-tuples like a | 如果我走这条路,我可以使用不同的分隔符而不是逗号分隔我的2元组,如| .

Is this possible in perl? 这可能在perl中吗?

I think the "cast" function you're referring to, might be eval . 我认为你所指的“演员”功能可能是eval

Using eval 使用eval

use strict;
use warnings;
use Data::Dumper;

my $string = "{ a => 1, b => 2, c => 3}";
my $thing =  eval $string;
print "thing is a ", ref($thing),"\n";
print Dumper $thing;

Will print: 将打印:

thing is a HASH
$VAR1 = {
            'a' => 1,
            'b' => 2,
            'c' => 3
          };

Or for arrays: 或者对于数组:

my $another_string = "[1, 2, 3 ]";
my  $another_thing = eval $another_string;
print "another_thing is ", ref ( $another_thing ), "\n";
print Dumper $another_thing;

another_thing is ARRAY
$VAR1 = [
            1,
            2,
            3
          ];

Although note that eval requires you to use brackets suitable for the appropriate data types - {} for anon hashes, and [] for anon arrays. 虽然请注意eval要求您使用适合于相应数据类型的括号 - {}表示匿名哈希值, []表示匿名数组。 So to take your example above: 所以举个例子:

my %hash4;
my $ip_string = "ips=[192.168.100.1,192.168.100.2]";
my ( $key, $value ) = split ( /=/, $ip_string );
$hash4{$key} = eval $value; 

my $hashthing_string = "{ key1 => 'val1', key2 => 'val2' }"; 
$hash4{'hashthing'} = eval $hashthing_string;
print Dumper \%hash4;

Gives: 得到:

$VAR1 = {
      'hashthing' => {
                       'key2' => 'val2',
                       'key1' => 'val1'
                     },
      'ips' => [
                 192.168.100.1,
                 192.168.100.2
               ]
    };

Using map to make an array into a hash 使用map将数组转换为哈希

If you want to turn an array into a hash, the map function is for that. 如果要将数组转换为散列,则map函数就是为此。

my @array = ( "red", "yellow", "blue" );
my %hash = map { $_ => 1 } @array; 
print Dumper \%hash;

Using slices of hashes 使用哈希slices

You can also use a slice if you have known values and known keys: 如果您具有已知值和已知密钥,也可以使用slice

my @keys = ( "c1", "c2", "c3" );
my %hash2;
@hash2{@keys} = @array;
print Dumper \%hash2;

JSON / XML JSON / XML

Or if you have control over the export mechanism, you may find exporting as JSON or XML format would be a good choice, as they're well defined standards for 'data as text'. 或者,如果您可以控制导出机制,您可能会发现导出为JSONXML格式是一个不错的选择,因为它们是“数据为文本”的明确标准。 (You could perhaps use Perl's Storable too, if you're just moving data between Perl processes). (如果您只是在Perl进程之间移动数据,也可以使用Perl的Storable )。

Again, to take the %hash4 above (with slight modifications, because I had to quote the IPs): 再次,采取上面的%hash4 (稍作修改,因为我必须引用IP):

use JSON; 
print encode_json(\%hash4);

Gives us: 给我们:

{"hashthing":{"key2":"val2","key1":"val1"},"ips":["192.168.100.1","192.168.100.2"]}

Which you can also pretty-print: 哪个你也可以打印:

use JSON; 
print to_json(\%hash4, { pretty => 1} );

To get: 要得到:

{
   "hashthing" : {
      "key2" : "val2",
      "key1" : "val1"
   },
   "ips" : [
      "192.168.100.1",
      "192.168.100.2"
   ]
}

This can be read back in with a simple: 这可以通过简单的方式回读:

my $data_structure = decode_json ( $input_text ); 

Style point 风格点

As a point of style - can I suggest that the way you've formatted your data structures isn't ideal. 作为一种风格 - 我可以建议您格式化数据结构的方式并不理想。 If you 'print' them with Dumper then that's a common format that most people will recognise. 如果你用Dumper “打印”它们,那么大多数人都会认识到这种格式。 So your 'first hash' looks like: 所以你的'第一个哈希'看起来像:

Declared as (not - my prefix, and () for the declaration, as well as quotes required under strict ): 声明为(不是 - 我的前缀,和()声明,以及strict要求的引号:

my %hash3 = (
    "ip" => "192.168.100.1",
    "port" => 80,
    "file" => "howdy.php"
);

Dumped as (brackets of {} because it's an anonymous hash, but still quoting strings): 转储为( {}括号,因为它是匿名哈希,但仍然引用字符串):

$VAR1 = {
          'file' => 'howdy.php',
          'ip' => '192.168.100.1',
          'port' => 80
        };

That way you'll have a bit more joy with people being able to reconstruct and interpret your code. 这样,人们能够重建和解释您的代码,您将获得更多的快乐。

Note too - that the dumper style format is also suitable (in specific limited cases) for re-reading via eval . 请注意 - 转储器样式格式也适合(在特定的有限情况下)通过eval重新读取。

Try this but compound values will have to be parsed separately. 试试这个,但必须分别解析复合值。

my $qr_key_1 = qr{
  (         # begin capture
    [^=]+   # equal sign is separator. NB: spaces captured too.
  )         # end capture
}msx;

my $qr_value_simple_1 = qr{
  (         # begin capture
    [^,]+   # comma is separator. NB: spaces captured too.
  )         # end capture
}msx;

my $qr_value_parenthesis_1 = qr{
  \(        # starts with parenthesis
  (         # begin capture
    [^)]+   # end with parenthesis NB: spaces captured too.
  )         # end capture
  \)        # end with parenthesis
}msx;

my $qr_value_brace_1 = qr{
  \{        # starts with brace
  (         # begin capture
    [^\}]+  # end with brace NB: spaces captured too.
  )         # end capture
  \}        # end with brace
}msx;

my $qr_value_3 = qr{
  (?:       # group alternative
    $qr_value_parenthesis_1
  |         # or other value
    $qr_value_brace_1
  |         # or other value
    $qr_value_simple_1
  )         # end group
}msx;

my $qr_end = qr{
  (?:       # begin group
    \,      # ends in comma
  |         # or
    \z      # end of string
  )         # end group
}msx;

my $qr_all_4 = qr{
  $qr_key_1     # capture a key
  \=            # separates key from value(s)
  $qr_value_3   # capture a value
  $qr_end       # end of key-value pair
}msx;



while( my $line = <DATA> ){
  print "\n\n$line";  # for demonstration; remove in real script
  chomp $line;

  while( $line =~ m{ \G $qr_all_4 }cgmsx ){
    my $key = $1;
    my $value = $2 || $3 || $4;

    print "$key = $value\n";  # for demonstration; remove in real script
  }
}

__DATA__
ip=192.168.100.1,port=80,file=howdy.php
ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}

Addendum: 附录:

The reason why it is so difficult to expand the parse is, in one word, context. 总而言之,扩展解析是如此困难的原因是上下文。 The first line of data, ip=192.168.100.1,port=80,file=howdy.php is context free. 第一行数据, ip=192.168.100.1,port=80,file=howdy.php是无上下文的。 That is, all the symbols in it do not change their meaning. 也就是说,其中的所有符号都不会改变它们的含义。 Context-free data format can be parsed with regular expressions alone. 无上下文数据格式可以单独使用正则表达式进行解析。

Rule #1: If the symbols denoting the data structure never change, it is a context-free format and regular expressions can parse it. 规则#1: 如果表示数据结构的符号永远不会改变,那么它是一种无上下文的格式,正则表达式可以解析它。

The second line, ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2} is a different issue. 第二行, ips=(192.168.100.1,192.168.100.2),port=80,file=howdy.php,hashthing={key1 => val1, key2 => val2}是一个不同的问题。 The meaning of the comma and equal sign changes. 逗号和等号的含义会发生变化。

Now, you're thinking the comma doesn't change; 现在,你认为逗号不会改变; it still separates things, doesn't it? 它还是分开的东西,不是吗? But it changes what it separates. 但它改变了它所分离的东西。 That is why the second line is more difficult to parse. 这就是为什么第二行更难以解析。 The second line has three contexts, in a tree: 第二行在树中有三个上下文:

main context
+--- list context
+--- hash context

The tokienizer must switch parsing sets as the data switches context. 当数据切换上下文时,tokienizer必须切换解析集。 This requires a state machine. 这需要一台状态机。

Rule #2: If the contexts of the data format form a tree, then it requires a state machine and different parsers for each context. 规则#2: 如果数据格式的上下文形成树,则它需要状态机和每个上下文的不同解析器。 The state machine determines which parser is in use. 状态机确定正在使用哪个解析器。 Since every context except the root have only one parent, the state machine can switch back to the parent at the end of its current context. 由于除root之外的每个上下文只有一个父级,因此状态机可以在其当前上下文的末尾切换回父级。

And this is the last rule, for completion sake. 为了完成起见,这是最后一条规则。 It is not used in this problem. 它不用于此问题。

Rule #3: If the contexts form a DAG (directed acyclic graph) or a recursive (aka cyclic) graph, then the state machine requires a stack so it will know which context to switch back to when it reaches the end of the current context. 规则#3: 如果上下文形成DAG(有向非循环图)或递归(又称循环)图,那么状态机需要一个堆栈,以便它知道当它到达当前上下文的末尾时切换回哪个上下文。

Now, you may have notice that there is no state machine in the above code. 现在,您可能已经注意到上面的代码中没有状态机。 It's there but it's hidden in the regular expressions. 它在那里,但它隐藏在正则表达式中。 But hiding it has a cost: the list and hash contexts are not parsed. 但隐藏它有一个成本:列表和哈希上下文不被解析。 Only their strings are found. 只找到他们的字符串。 They have to be parsed separately. 它们必须单独解析。

Explanation: 说明:

The above code uses the qr// operator to create the parsing regular expression. 上面的代码使用qr //运算符来创建解析正则表达式。 The qr// operator compiles a regular expression and returns a reference to it. qr//运算符编译正则表达式并返回对它的引用。 This reference can be used in a match, substitute, or another qr// expression. 此引用可用于匹配,替换或其他qr//表达式。 Think of each qr// expression as a subroutine. 将每个qr//表达式视为子例程。 Just like normal subroutines, qr// expressions can be used in other qr// expressions, building up complex regular expressions from simpler ones. 就像普通的子程序一样, qr//表达式可以用在其他qr//表达式中,从更简单的表达式构建复杂的正则表达式。

The first expression, $qr_key_1 , captures the key name in the main context. 第一个表达式$qr_key_1捕获主上下文中的键名。 Since the equal sign separates the key from the value, it captures all non-equal-sign characters. 由于等号将键与值分开,因此它捕获所有非等号字符。 The "_1" on the end of the variable name is what I use to remind myself that one capture group is present. 变量名末尾的“_1”是我用来提醒自己存在一个捕获组的东西。

The options on the end of the expression, /m , /s , and /x , are recommended in Perl Best Practices but only the /x option has an effect. Perl最佳实践中建议使用表达式结尾处的选项/m/s/x ,但只有/x选项才有效。 It allows spaces and comments in the regular expression. 它允许在正则表达式中使用空格和注释。

The next expression, $qr_value_simple_1 , captures simple values for the key. 下一个表达式$qr_value_simple_1捕获密钥的简单值。

The next one, $qr_value_parenthesis_1 , handles the list context. 下一个$qr_value_parenthesis_1处理列表上下文。 This is possible only because a closing parenthesis has only one meaning: end of list context. 这是可能的,因为右括号只有一个含义:列表上下文的结尾。 But is also has a price: the list is not parsed; 但也有价格:列表未解析; only its string is found. 只找到它的字符串。

And again for $qr_value_brace_1 : the closing brace has only one meaning. 再次为$qr_value_brace_1 :结束括号只有一个含义。 And the hash is also not parsed. 哈希也没有被解析。

The $qr_value_3 expression combines the value REs into one. $qr_value_3表达式将值RE合并为一个。 The $qr_value_simple_1 must be last but the others can be in any order. $qr_value_simple_1必须是最后一个,但其他的可以是任何顺序。

The $qr_end parses the end of a field in the main context. $qr_end解析主上下文中字段的结尾。 There is no number at its end because it does not capture anything. 最后没有数字,因为它没有捕获任何东西。

And finally, $qr_all_4 puts them all together to create the RE for data. 最后, $qr_all_4将它们全部放在一起以创建数据的RE。

The RE used in the inner loop, m{ \\G $qr_all_4 }cgmsx , parses out each field in the main context. 内循环中使用的RE, m{ \\G $qr_all_4 }cgmsx ,解析主上下文中的每个字段。 The \\G assertion means: if the has been changed since the last call (or it has never been called), then start the match at the beginning of the string; \\G断言意味着:如果自上次调用以来已经更改(或者从未调用过),则在字符串的开头开始匹配; otherwise, start where the last match finished. 否则,从最后一场比赛结束开始。 This is used in conjunction with the /c and /g``options to parse each field out from the $line`, one at a time for processing inside the loop. 这与/c/g``options to parse each field out from the $ line`中的/g``options to parse each field out from the ,一次一个用于在循环内处理。

And that is briefly what is happening inside the code. 这简要介绍了代码中发生的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM