简体   繁体   中英

How to split files up and process them in parallel and then stitch them back? unix

I have a text file infile.txt as such:

abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?

Each line in the file will be processed by this perl command into the out.txt

`cat infile.txt | perl dosomething > out.txt`

Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:

$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt

But is there a less verbose way to do the same?

The answer from @Ulfalizer gives you a good hint about the solution, but it lacks of details.

You can use GNU parallel ( apt-get install parallel on Debian)

So your problem can be solved using the following command:

parallel -a infile.txt -l 1000 -j10 -k --spreadstdin perl dosomething > result.txt

Here is the signification of arguments

-a: read input from file instead of stdin
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command

I've never tried it myself, but GNU parallel might be worth checking out.

Here's an excerpt from the man page ( parallel(1) ) that's similar to what you're currently doing. It can split the input in other ways too.

EXAMPLE: Processing a big file using more cores
       To process a big file or some output you can use --pipe to split up
       the data into blocks and pipe the blocks into the processing program.

       If the program is gzip -9 you can do:

       cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

       This will split bigfile into blocks of 1 MB and pass that to gzip -9
       in parallel. One gzip will be run per CPU core. The output of gzip -9
       will be kept in order and saved to bigfile.gz

Whether this is worthwhile depends on how CPU-intensive your processing is. For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.

You can find some introductory videos by the GNU Parallel author here .

Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager :

#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my $max_forks = 8; #2x procs is usually optimal

sub process_line {
    #do something with this line
}

my $fork_manager = Parallel::ForkManager -> new ( $max_forks ); 

open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
    $fork_manager -> start and next;
    process_line ( $line );
    $fork_manager -> finish;
}

close ( $input );
$fork_manager -> wait_all_children();

The downside of doing something like this though is that of coalescing your output. Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.

You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway).

There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM