简体   繁体   中英

Perl Distributed parallel computing

I would like to know if there are any perl modules available to enable distributed parallel computation similar to apache hadoop.

Example, A perl script to be executed in many machines parallely when submitted to a client node.

I'm the author of the Many-core Engine for Perl.

During the next several weekends, I will take MCE for a spin with Gearman::XS . MCE is good at maximizing available cores on a given node. Gearman is good at job distribution and includes a lot of features such as load balancing. Combining the two together was my thought for scaling MCE horizontally across many nodes. :) I did not share this news with anybody until just now.

Why are the two modules a good fit (my humble opinion):

  1. For distribution, one needs some sort of chunking engine. MCE is a chunking engine -- so breaking up input is natural to MCE. Essentially MCE can be used at both sides, the job submission host as well as on the worker node for maximizing available cores.

  2. For worker nodes, MCE follows a bank-queuing model when processing input data. This helps guarantee that all CPUs remain busy from the start of the job till the very end. As workers being to idle down, the remaining "working" are processing their very last chunk.

One's imagination is the limit here -- there are so many possibilities with these 2 modules working together. When writing MCE, I first focused on the node side. Job distribution is next obviously and I did a search and came across Gearman::XS. The 2 modules can chunk happily together :) Job distribution side (bigger chunks), once on node (smaller chunks). All the networking stuff is handled by Gearman.

Basically, there's no need for me to write the job distribution aspect when Gearman::XS is already quite nice. This has been my plan. I will write about Gearman::XS + MCE soon.

BTW: Folks can do similar things with GRID-Machine + MCE I imagine. MCE's beauty is on maximizing all available cores on any given node.

Another magical thing about MCE is that one may not want 200 nodes * 16 workers all reading/writing from/to the NFS server for example. That will impact the NFS server greatly. BTW: RHEL 6.4 will include pNFS (parallel NFS). With MCE, workers can call the "do" method to serialize writes/reads from NFS. So instead of 200 * 16 = 3200 attacking NFS, it becomes just 200 maximum requests against the NFS server at any given time (1 per physical node).

When writing MCE, grace can be applied for many scenarios. I need to add more wikis to MCE's home page MCE at code.google.com . In addition, MCE eats really big log files for breakfast :) Check out egrep.pl and wc.pl under the examples dir. It even beats the wide finder project with sequential IO (powerful slurp IO among many workers).

Check out the images included with the MCE distribution. Oh, do not forget to check out the main Gearman site as well.

What's left after this? Humm, the web piece. One idea which comes to mind is to use Mojo . There are many options. This is just one:

Gearman::XS + MCE + Mojolicious

Again, one can use GRID-Machine instead of Gearman::XS if wanting to communicate through SSH.

Anyway, that was my plan to use an already available job distribution module. For MCE, my focus was on maximizing performance on a single node -- to include chunking, serializing, bank-queuing model, user tasks (allows for many roles), number sequencing among workers, and sequential slurp IO.

-- mario

You might look into something as simple as a message queue like ZeroMQ . I'm sure a CPAN search could help with some other suggestions.

Recently there has been some talk of the Many Core Engine MCE module, which you might want to investigate, I don't know for sure that it lets you parallelize off the host computer, but it seems like it wouldn't be a big step given its stated purpose.

The GRID module on CPAN is designed for working with distributed computing.

https://metacpan.org/pod/distribution/GRID-Machine/lib/GRID/Machine/perlparintro.pod

Argon may provide what you are looking for (disclaimer - I'm the author). It allows you to set up an arbitrary network of workers, each of which runs a process pool (using Coro::ProcessPool ).

Creating a task is pretty simple:

use Argon::Client;

my $client = Argon::Client->new(host => "somehost", port => 8000);
my $result = $client->queue(sub {
    use My::Work::Module qw(do_work);
    my $task_id = shift;
    do_work($task_id);
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM