简体   繁体   English

如何拆分文件并将它们并行处理然后再缝合? UNIX

[英]How to split files up and process them in parallel and then stitch them back? unix

I have a text file infile.txt as such: 我有一个文本文件infile.txt如下:

abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?

Each line in the file will be processed by this perl command into the out.txt 文件中的每一行都将通过此perl命令处理到out.txt中

`cat infile.txt | perl dosomething > out.txt`

Imagine if the textfile is 100,000,000 lines. 想象一下,如果文本文件是100,000,000行。 I want to parallelize the bash command so i tried something like this: 我想并行化bash命令,所以我尝试了这样的事情:

$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt

But is there a less verbose way to do the same? 但是,有一种不那么冗长的方式来做同样的事情吗?

The answer from @Ulfalizer gives you a good hint about the solution, but it lacks of details. @Ulfalizer的答案为您提供了解决方案的良好提示,但缺乏细节。

You can use GNU parallel ( apt-get install parallel on Debian) 你可以使用GNU parallel (Debian上的apt-get install parallel

So your problem can be solved using the following command: 因此,使用以下命令可以解决您的问题:

parallel -a infile.txt -l 1000 -j10 -k --spreadstdin perl dosomething > result.txt

Here is the signification of arguments 这是论证的意义

-a: read input from file instead of stdin
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command

I've never tried it myself, but GNU parallel might be worth checking out. 我自己从未尝试过,但GNU parallel可能值得一试。

Here's an excerpt from the man page ( parallel(1) ) that's similar to what you're currently doing. 以下是手册页( parallel(1) )的摘录,与您目前正在进行的操作类似。 It can split the input in other ways too. 它也可以用其他方式分割输入。

EXAMPLE: Processing a big file using more cores
       To process a big file or some output you can use --pipe to split up
       the data into blocks and pipe the blocks into the processing program.

       If the program is gzip -9 you can do:

       cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

       This will split bigfile into blocks of 1 MB and pass that to gzip -9
       in parallel. One gzip will be run per CPU core. The output of gzip -9
       will be kept in order and saved to bigfile.gz

Whether this is worthwhile depends on how CPU-intensive your processing is. 这是否值得取决于您的处理CPU密集程度。 For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much. 对于简单的脚本,您将花费大部分时间将数据移入和移出磁盘,并行化并不会让您受到太多影响。

You can find some introductory videos by the GNU Parallel author here . 您可以找到由GNU并行笔者部分介绍影片在这里

Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager : 假设您的限制因素不是您的磁盘,您可以使用fork()并特别是Parallel::ForkManager在perl中执行此Parallel::ForkManager

#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my $max_forks = 8; #2x procs is usually optimal

sub process_line {
    #do something with this line
}

my $fork_manager = Parallel::ForkManager -> new ( $max_forks ); 

open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
    $fork_manager -> start and next;
    process_line ( $line );
    $fork_manager -> finish;
}

close ( $input );
$fork_manager -> wait_all_children();

The downside of doing something like this though is that of coalescing your output. 做这样的事情的缺点是合并你的输出。 Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results. 每个并行任务不一定按照它开始的顺序完成,因此您有关于序列化结果的各种潜在问题。

You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. 您可以使用像flock这样的东西解决这些问题,但是您需要小心,因为太多的锁定操作可能会首先夺走您的并行优势。 (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway). (因此我的第一个声明 - 如果你的限制因素是磁盘IO,那么并行性无论如何都无济于事)。

There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too. 但是有各种可能的解决方案 - 在perl docs中写了很多章节: perlipc - 但请记住,你也可以用Parallel::ForkManager检索数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用unix shell脚本将数字分割并存储在不同的文件中 - split numbers in and store them in different files using unix shell script 并行分割文本和进程 - Split up text and process in parallel 如何将多个文本文件附加在一起然后拆分? - How to append multiple text files together and then split them? Unix/bash:读取文件列表并将它们合并到一个文件 - Unix/bash: Reading A List of Files and Merge Them To A File Unix-比较两个文件并逐行对齐 - Unix - Compare two files and align them line by line 如何遍历.log文件,通过awk处理它们,并替换为具有不同扩展名的输出文件? - How can I iterate over .log files, process them through awk, and replace with output files with different extensions? 如何将三角括号拆分为几个? - How to split triange brackets to couple of them? Bash - 如何将文本块剪切成单独的文件,然后在操作后将它们放回原处? - Bash - How can I cut blocks of text into individual files, then place them back in after manipulation? 如何获取当前目录中的实际文件(而不在脚本中精确命名)并进行处理? - How to take actual files (without naming it precisely in the script) in the current directory and to process them? 如何下载文件,并仅在某些文件不存在时处理它们? - How can I download files, and process them only if some file does not exist?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM