[英]How to split files up and process them in parallel and then stitch them back? unix
I have a text file infile.txt
as such: 我有一个文本文件
infile.txt
如下:
abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?
Each line in the file will be processed by this perl command into the out.txt 文件中的每一行都将通过此perl命令处理到out.txt中
`cat infile.txt | perl dosomething > out.txt`
Imagine if the textfile is 100,000,000 lines. 想象一下,如果文本文件是100,000,000行。 I want to parallelize the bash command so i tried something like this:
我想并行化bash命令,所以我尝试了这样的事情:
$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt
But is there a less verbose way to do the same? 但是,有一种不那么冗长的方式来做同样的事情吗?
The answer from @Ulfalizer gives you a good hint about the solution, but it lacks of details. @Ulfalizer的答案为您提供了解决方案的良好提示,但缺乏细节。
You can use GNU parallel ( apt-get install parallel
on Debian) 你可以使用GNU parallel (Debian上的
apt-get install parallel
)
So your problem can be solved using the following command: 因此,使用以下命令可以解决您的问题:
parallel -a infile.txt -l 1000 -j10 -k --spreadstdin perl dosomething > result.txt
Here is the signification of arguments 这是论证的意义
-a: read input from file instead of stdin
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command
I've never tried it myself, but GNU parallel might be worth checking out. 我自己从未尝试过,但GNU parallel可能值得一试。
Here's an excerpt from the man page ( parallel(1)
) that's similar to what you're currently doing. 以下是手册页(
parallel(1)
)的摘录,与您目前正在进行的操作类似。 It can split the input in other ways too. 它也可以用其他方式分割输入。
EXAMPLE: Processing a big file using more cores To process a big file or some output you can use --pipe to split up the data into blocks and pipe the blocks into the processing program. If the program is gzip -9 you can do: cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz This will split bigfile into blocks of 1 MB and pass that to gzip -9 in parallel. One gzip will be run per CPU core. The output of gzip -9 will be kept in order and saved to bigfile.gz
Whether this is worthwhile depends on how CPU-intensive your processing is. 这是否值得取决于您的处理CPU密集程度。 For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.
对于简单的脚本,您将花费大部分时间将数据移入和移出磁盘,并行化并不会让您受到太多影响。
You can find some introductory videos by the GNU Parallel author here . 您可以找到由GNU并行笔者部分介绍影片在这里 。
Assuming your limiting factor is NOT your disk, you can do this in perl with fork()
and specifically Parallel::ForkManager
: 假设您的限制因素不是您的磁盘,您可以使用
fork()
并特别是Parallel::ForkManager
在perl中执行此Parallel::ForkManager
:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $max_forks = 8; #2x procs is usually optimal
sub process_line {
#do something with this line
}
my $fork_manager = Parallel::ForkManager -> new ( $max_forks );
open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
$fork_manager -> start and next;
process_line ( $line );
$fork_manager -> finish;
}
close ( $input );
$fork_manager -> wait_all_children();
The downside of doing something like this though is that of coalescing your output. 做这样的事情的缺点是合并你的输出。 Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.
每个并行任务不一定按照它开始的顺序完成,因此您有关于序列化结果的各种潜在问题。
You can work around these with something like flock
but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. 您可以使用像
flock
这样的东西解决这些问题,但是您需要小心,因为太多的锁定操作可能会首先夺走您的并行优势。 (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway). (因此我的第一个声明 - 如果你的限制因素是磁盘IO,那么并行性无论如何都无济于事)。
There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager
too. 但是有各种可能的解决方案 - 在perl docs中写了很多章节: perlipc - 但请记住,你也可以用
Parallel::ForkManager
检索数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.