简体   繁体   English

尝试根据每个记录的内容将一个非常大的文件拆分为多个较小的文件(perl / linux)

[英]Trying to split a very large file into multiple smaller files based on the contents of each record (perl/linux)

Here is the problem. 这是问题所在。

I have 20 very large files, each approx 10gb, and I need to split each of the bulk files by A) criteria within the record and B) what type of bulk file it is. 我有20个非常大的文件,每个文件大约10gb,我需要按A)记录中的条件和B)它是什么类型的大文件来拆分每个大文件。

Example. 例。

Each bulk file represents an occupation. 每个批量文件代表一个职业。 We have Lawyers, Doctors, Teachers and Programmers. 我们有律师,医生,教师和程序员。 Each of these bulk files contain millions of records for different individuals, not a lot of individuals, say 40 different people in total. 这些批量文件中的每一个都包含数百万个针对不同个人(而不是很多个人)的记录,比如说总共40个人。

A record in the doctor file may look like 医生文件中的记录可能看起来像

XJOHN 1234567   LOREMIPSUMBLABLABLA789

I would need this record from the file to be output into a file called JOHN.DOCTOR.7 我需要将此记录从文件中输出到名为JOHN.DOCTOR.7的文件中

John is the persons name, 7 is the last digit in the numeric sequence, and DOCTOR was the file type. John是人员名称,7是数字序列中的最后一位数字,而DOCTOR是文件类型。 I need to do this for file size limitations. 我需要这样做是为了限制文件大小。 Currently, I'm using perl to read the bulk files line by line and print the record into the appropriate output file. 当前,我正在使用perl逐行读取批量文件,并将记录打印到适当的输出文件中。 I'm opening a new handler for each record to avoid having multiple threads writing to the same handler and causing data malformations. 我为每个记录打开一个新的处理程序,以避免让多个线程写入同一处理程序并导致数据格式错误。 I do have the program threaded, one thread per bulk file. 我确实有该程序的线程,每个批量文件一个线程。 I cannot install any third party applications, assume I only have whatever comes standard with RedHat Linux. 我不能安装任何第三方应用程序,假设我只有RedHat Linux的标准配置。 I'm looking for either a Linux command that has a more efficient way of doing this or perhaps a better way that perl offers. 我正在寻找一种具有更有效执行此操作的Linux命令或perl提供的更好方法。

Thanks! 谢谢!

An alternate approach is to use processes instead of threads, via Parallel::ForkManager 另一种方法是通过Parallel :: ForkManager使用进程而不是线程

Additionally, I would consider using a map/reduce approach by giving each process/thread its own work directory, in which it would write intermediate files, one per doctor, lawyer, etc. 另外,我会考虑使用map / reduce方法,即为每个进程/线程分配自己的工作目录,在该目录中将写入中间文件,每个医生,律师等都会编写一个中间文件。

I would then write a second program, the reducer, which could be a very short shell script, to concatenate the intermediate files into their respective final output files. 然后,我将编写第二个程序,reducer(可能是一个很短的shell脚本),将中间文件连接到它们各自的最终输出文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Bash将大文件拆分为较小的文件 - Bash split large file into smaller files 根据列值拆分大文件-Linux - Split large file based on column value - linux Bash将2行块的大文件拆分为较小的文件 - Bash split large file of 2 line chunks into smaller files 在linux中基于文件内的文本拆分大文件的最快方法 - Quickest way to split a large file based on text within the file in linux 写入单个大数据文件或多个小文件:哪个更快? - Writing a single large data file, or multiple smaller files: Which is faster? Linux:将文件列拆分为多个文件 - Linux: Split columns of file into multiple files 在 linux (centos 7) 中将大文件(例如 300GB)拆分为 100MB 的小块 - To split a large file (For ex., 300GB) into smaller chunks of 100MB's in linux (centos 7) 如何将带有制表符分隔值的大文件拆分为较小的文件,同时根据第一个值将行保持在单个文件中? - How can I split a large file with tab-separated values into smaller files while keeping lines together in a single file based on the first value? 如何在Linux / Bash中将大文件分割为带前缀的小文件 - How to split large file to small files with prefix in Linux/Bash tar非常大的文件直接通过FTP分割成较小的文件 - tar very large files to FTP directly splited into smaller files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM