简体   繁体   English

在java中对巨大的file.txt进行排序

[英]sorting lines of an enormous file.txt in java

I'm working with a very big text file (755Mb). 我正在使用一个非常大的文本文件(755Mb)。 I need to sort the lines (about 1890000) and then write them back in another file. 我需要对行(约1890000)进行排序,然后将它们写回另一个文件中。

I already noticed that discussion that has a starting file really similar to mine: Sorting Lines Based on words in them as keys 我已经注意到有一个起始文件的讨论与我的相似: 排序行根据其中的单词作为键

The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!) 问题是我无法将行存储在内存中的集合中,因为我得到了Java堆空间异常(即使我最大限度地扩展它)..(已经尝试过!)

I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded.. 我不能用excel打开它并使用排序功能,因为文件太大而且无法完全加载..

I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong? 我想过使用DB ..但是我认为编写所有行然后使用SELECT查询它在执行时间方面太长了......我错了吗?

Any hints appreciated Thanks in advance 任何提示赞赏提前谢谢

I think the solution here is to do a merge sort using temporary files: 我认为这里的解决方案是使用临时文件进行合并排序:

  1. Read the first n lines of the first file, ( n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). 读取第一个文件的前n行( n是你可以在内存中存储和排序的行数),对它们进行排序,然后将它们写入文件1.tmp (或者你可以调用它)。 Do the same with the next n lines and store it in 2.tmp . 对接下来的n行执行相同操作并将其存储在2.tmp Repeat until all lines of the original file has been processed. 重复,直到处理完原始文件的所有行。

  2. Read the first line of each temporary file. 阅读每个临时文件的第一行。 Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. 确定最小的一个(根据您的排序顺序),将其写入目标文件,并从相应的临时文件中读取下一行。 Repeat until all lines have been processed. 重复,直到处理完所有行。

  3. Delete all the temporary files. 删除所有临时文件。

This works with arbitrary large files, as long as you have enough disk space. 只要您有足够的磁盘空间,这适用于任意大文件。

You can run the following with 您可以运行以下命令

-mx1g -XX:+UseCompressedStrings  # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings  # on Java 6 update 29
-mx2g  # on Java 7 update 2.

import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class Main {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        generateFile("lines.txt", 755 * 1024 * 1024, 189000);

        List<String> lines = loadLines("lines.txt");

        System.out.println("Sorting file");
        Collections.sort(lines);
        System.out.println("... Sorted file");
        // save lines.
        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
    }

    private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
        System.out.println("Creating file to load");
        int lineSize = size / lines;
        StringBuilder sb = new StringBuilder();
        while (sb.length() < lineSize) sb.append('-');
        String padding = sb.toString();

        PrintWriter pw = new PrintWriter(fileName);
        for (int i = 0; i < lines; i++) {
            String text = (i + padding).substring(0, lineSize);
            pw.println(text);
        }
        pw.close();
        System.out.println("... Created file to load");
    }

    private static List<String> loadLines(String fileName) throws IOException {
        System.out.println("Reading file");
        BufferedReader br = new BufferedReader(new FileReader(fileName));
        List<String> ret = new ArrayList<String>();
        String line;
        while ((line = br.readLine()) != null)
            ret.add(line);
        System.out.println("... Read file.");
        return ret;
    }
}

prints 版画

Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file

Algorithm: 算法:

How much memory do we have available? 我们有多少内存? Let's assume we have X MB of memory available. 假设我们有X MB的内存可用。

  1. Divide the file into K chunks, where X * K = 2 GB . 将文件分成K块,其中X * K = 2 GB Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. 将每个块放入内存并使用任何O(n log n)算法照常排序。 Save the lines back to the file. 将行保存回文件。

  2. Now bring the next chunk into memory and sort. 现在将下一个块放入内存并进行排序。

  3. Once we're done, merge them one by one. 完成后,将它们逐个合并。

The above algorithm is also known as external sort. 上述算法也称为外部排序。 Step 3 is known as N-way merge 步骤3称为N路合并

Why don't you try multithreading and increasing heap size of the program you are running? 为什么不尝试多线程并增加正在运行的程序的堆大小? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.) (这也要求你使用合并排序类的东西,只要你的系统中有超过755mb的内存。)

divide and conquer is the best solution :) 分而治之是最好的解决办法:)

divide your file to smaller ones, sort each file seperately then regroup. 将文件分成较小的文件,单独对每个文件进行排序然后重新组合。

Links: 链接:

Sort a file with huge volume of data given memory constraint 给定内存约束时,对具有大量数据的文件进行排序

http://hackerne.ws/item?id=1603381 http://hackerne.ws/item?id=1603381

Maybe u can use perl to format the file .and load into the database like mysql. 也许你可以使用perl来格式化文件。并像mysql一样加载到数据库中。 it's so fast. 它太快了。 and use the index to query the data. 并使用索引来查询数据。 and write to another file. 并写入另一个文件。

u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks 你可以设置jvm堆大小,如'-Xms256m -Xmx1024m'。我希望能帮助你。谢谢你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM