简体   繁体   English

在java中处理大文件

[英]Processing huge files in java

I have a huge file of around 10 GB. 我有一个大约10 GB的巨大文件。 I have to do operations such as sort, filter, etc on the files in Java. 我必须对Java中的文件执行排序,过滤等操作。 Each operation can be done in parallel. 每个操作可以并行完成。

Is it good to start 10 threads and read the file in parallel ? 启动10个线程并且并行读取文件是否合适? Each thread reads 1 GB of the file. 每个线程读取1 GB的文件。 Is there any other option to solve the issue with extra large files and processing them as fast as possible? 有没有其他选项来解决超大文件的问题并尽快处理它们? Is NIO good for such scenarios? NIO是否适合这种情况?

Currently, I am performing operations in serial and it takes around 20 mins to process such files. 目前,我正在进行串行操作,处理此类文件大约需要20分钟。

Thanks, 谢谢,

Is it good to start 10 threads and read the file in parallel ? 启动10个线程并且并行读取文件是否合适?

Almost certainly not - although it depends. 几乎肯定不是 - 虽然它取决于。 If it's from an SSD (where there's effectively no seek time) then maybe . 如果它来自SSD(那里实际上没有寻找时间)那么可能 If it's a traditional disk, definitely not. 如果它是传统的磁盘,绝对不是。

That doesn't mean you can't use multiple threads though - you could potentially create one thread to read the file, performing only the most rudimentary tasks to get the data into processable chunks. 这并不意味着您不能使用多个线程 - 您可能会创建一个线程来读取文件,只执行最基本的任务以将数据转换为可处理的块。 Then use a producer/consumer queue to let multiple threads process the data. 然后使用生产者/消费者队列让多个线程处理数据。

Without knowing more than "sort, filter, etc" (which is pretty vague) we can't really tell how parallelizable the process is in the first place - but trying to perform the IO in parallel on a single file will probably not help. 在不知道“排序,过滤等”的情况下(这是非常模糊的)我们无法真正说明该过程的可并行化程度 - 但尝试在单个文件上并行执行IO 可能无济于事。

Try profiling the code to see where the bottlenecks are. 尝试分析代码以查看瓶颈所在。 Have you tried having one thread read the whole file (or as much as possible), and give that off to 10 threads for processing? 您是否尝试过让一个线程读取整个文件(或尽可能多),并将其交给10个线程进行处理? If File I/O is your bottleneck (which seems plausible), this should improve your overall run time. 如果文件I / O是您的瓶颈(这似乎是合理的),这将改善您的整体运行时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM