[英]Trying to split a very large file into multiple smaller files based on the contents of each record (perl/linux)
Here is the problem. 这是问题所在。
I have 20 very large files, each approx 10gb, and I need to split each of the bulk files by A) criteria within the record and B) what type of bulk file it is. 我有20个非常大的文件,每个文件大约10gb,我需要按A)记录中的条件和B)它是什么类型的大文件来拆分每个大文件。
Example. 例。
Each bulk file represents an occupation. 每个批量文件代表一个职业。 We have Lawyers, Doctors, Teachers and Programmers. 我们有律师,医生,教师和程序员。 Each of these bulk files contain millions of records for different individuals, not a lot of individuals, say 40 different people in total. 这些批量文件中的每一个都包含数百万个针对不同个人(而不是很多个人)的记录,比如说总共40个人。
A record in the doctor file may look like 医生文件中的记录可能看起来像
XJOHN 1234567 LOREMIPSUMBLABLABLA789
I would need this record from the file to be output into a file called JOHN.DOCTOR.7
我需要将此记录从文件中输出到名为JOHN.DOCTOR.7
的文件中
John is the persons name, 7 is the last digit in the numeric sequence, and DOCTOR was the file type. John是人员名称,7是数字序列中的最后一位数字,而DOCTOR是文件类型。 I need to do this for file size limitations. 我需要这样做是为了限制文件大小。 Currently, I'm using perl to read the bulk files line by line and print the record into the appropriate output file. 当前,我正在使用perl逐行读取批量文件,并将记录打印到适当的输出文件中。 I'm opening a new handler for each record to avoid having multiple threads writing to the same handler and causing data malformations. 我为每个记录打开一个新的处理程序,以避免让多个线程写入同一处理程序并导致数据格式错误。 I do have the program threaded, one thread per bulk file. 我确实有该程序的线程,每个批量文件一个线程。 I cannot install any third party applications, assume I only have whatever comes standard with RedHat Linux. 我不能安装任何第三方应用程序,假设我只有RedHat Linux的标准配置。 I'm looking for either a Linux command that has a more efficient way of doing this or perhaps a better way that perl offers. 我正在寻找一种具有更有效执行此操作的Linux命令或perl提供的更好方法。
Thanks! 谢谢!
An alternate approach is to use processes instead of threads, via Parallel::ForkManager 另一种方法是通过Parallel :: ForkManager使用进程而不是线程
Additionally, I would consider using a map/reduce approach by giving each process/thread its own work directory, in which it would write intermediate files, one per doctor, lawyer, etc. 另外,我会考虑使用map / reduce方法,即为每个进程/线程分配自己的工作目录,在该目录中将写入中间文件,每个医生,律师等都会编写一个中间文件。
I would then write a second program, the reducer, which could be a very short shell script, to concatenate the intermediate files into their respective final output files. 然后,我将编写第二个程序,reducer(可能是一个很短的shell脚本),将中间文件连接到它们各自的最终输出文件中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.