简体   繁体   中英

Perl: store many files content with Parallel::ForkManager

I have three files (two tab-separated fields, no redundancies among files). I want to read them in parallel and store their content in one single hash.

This is what I tried:

use warnings;
use strict;
use Parallel::ForkManager;
use Data::Dumper;

my @files = ('aa', 'ab', 'ac');

my %content;
my $max_processors = 3;
my $pm = Parallel::ForkManager->new($max_processors);

foreach my $file (@files) {
    $pm->start and next;

    open FH, $file or die $!;
    while(<FH>){
        chomp;
        my($field1, $field2) = split/\t/,$_;
        $content{$field1} = $field2;
    }
    close FH;

    $pm->finish;
}
$pm->wait_all_children;

print Dumper \%content;

The output of this script is

$VAR1 = {};

I can see that the three files are processed in parallel but... How can I store the content of the three for post-fork processing?

When you fork, the child process has it's own separate memory so the parent won't have access to the data you've read in. You'd have to find a way for the child to communicate the data back, maybe via pipes, but at that point you might as well not bother with forking and just read the data in direct.

What you probably want to look into, is using threads as they share the same memory.

You can do it with a run_on_finish() callback and the data stored as a reference with something like the filename as a key (see the Data structure retrieval section of the docs for an example).

So, if you make your file reading code a subroutine, have it return the data as a reference, and then use a callback, you might end up with something like this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use Parallel::ForkManager;
use Data::Dump;

sub proc_file {
    # Read the file and split into a hash; assuming the data struct, based on
    # OP's example.
    my $file = shift;
    open(my $fh, "<", $$file);
    my %content = map{ chomp; split(/\t/) }<$fh>;
    return \%content;
}

my %content;
my @files = ('aa','ab','ac');

my $pm = new Parallel::ForkManager(3);
$pm->run_on_finish(
    sub {
        my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;
        my $input_file = $data_structure_reference->{input};
        $content{$input_file} = $data_structure_reference->{result};
    }
);

# For each file, fork a child, and on finish create an object ref to the file
# and the results of processing, that can be stored in the $data_structure_reference.
for my $input_file (@files) {
    $pm->start and next;
    my $return_data = proc_file(\$input_file);

    $pm->finish(0,
        {
          result  => $return_data,
          input   => $input_file,
        }
     );
}
$pm->wait_all_children;

dd \%content;

That will result in a hash of hashes with the filename as a key and the contents as a sub-hash, which you can easily collapse or pool together or whatever you like:

$ ./parallel.pl a*
{
  aa => { apple => "pear" },
  ab => { Joe => "Wilson" },
  ac => { "New York" => "Mets" },
}

Note, that like any forking procedure, there's quite a bit of overhead cost associated, and this may not end up speeding up your processing any more than simply looping through the files sequentially.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM