Java-使用扫描仪在Delimeter上拆分大型SQL文本文件（OutOfMemoryError）

Question

I am trying to write an application that will take a very large sql text file ~60GB (257 million lines) and split each of the COPY statements into seperate text files. 我正在尝试编写一个应用程序，该应用程序将使用一个非常大的sql文本文件〜60GB（2.57亿行），并将每个COPY语句拆分为单独的文本文件。

However the code I am currently using causes an OutOfMemoryError due to the lines exceeding the Scanner Buffer limit. 但是，由于行超过了“扫描程序缓冲区”限制，我当前正在使用的代码导致OutOfMemoryError。 The first statement is going to be ~40 million lines long. 第一条声明将是大约4000万行。

public static void readFileByDelimeter(String fileName, String requestType, String output) throws FileNotFoundException {

//creating file instance
File file = new File(fileName);

//create scanner instance
Scanner scanner = new Scanner(file, "latin1");

//set custom delimeter
scanner.useDelimeter("COPY");

int number = 0;
System.out.println("Running......");
while (scanner.hasNext()){
    String line = scanner.next();
    if (line.length() > 20) {
        //save statements to seperate SQL files
        PrintWriter out = new PrintWriter("statement" + number + ".sql");
        out.println("COPY" + line.trim());
        out.close();
        }
        number++;
    }

System.out.println("Completed");
}

Please provide recommendation as to whether this is the wrong method for performing this or alterations to the existing method. 请提供有关这是执行该方法的错误方法还是对现有方法的更改的建议。

Thanks 谢谢

Answer 1

Me personally: I use BufferedReader in stead of Scanner. 我个人：我使用BufferedReader代替Scanner。 It also has a convenient readLine() method and I've never had any performance issues with it. 它还具有一个方便的readLine（）方法，并且我从未遇到过任何性能问题。 The only thing is that you'd need to manually check if a line read is one that you want to process, but that's usually as simple as applying the String class methods. 唯一的是，您需要手动检查是否要读取的行是要处理的行，但这通常与应用String类方法一样简单。

That's not an answer to your actual question, but I consider it a decent easy to use alternative. 这不是您实际问题的答案，但我认为这是一种易于使用的替代方法。

Answer 2

Try something like this (but prettier): 试试这样的东西（但更漂亮）：

Scanner sc = new Scanner(new BufferedReader(new FileReader(file)));

This decorates the whole thing with a BufferedReader, meaning that not all of the file's content will be loaded into memory at once. 这用BufferedReader装饰了整个内容，这意味着并非所有文件的内容都会立即加载到内存中。 You can use the Scanner in the same way. 您可以以相同方式使用扫描仪。

Answer 3

try to use a BufferedReader. 尝试使用BufferedReader。 Direct use of scanner with file or raw file streams woudl load up the data in memory and wont flush it out on GC. 直接将扫描仪与文件流或原始文件流一起使用会加载内存中的数据，并且不会在GC上将其清除。 Bets approach is to use BufferedReader and read one line at a time and do manual string checks and splitting. 赌注的方法是使用BufferedReader并一次读取一行，并进行手动字符串检查和拆分。 If done correctly this way you can give the GC enough opportunity to reclaim memory when needed 如果以这种方式正确完成，则可以为GC提供足够的机会在需要时回收内存

Answer 4

First, why you are creating or some other process is creating 60GB file ! 首先，为什么要创建或其他一些过程正在创建60GB文件！ maybe you need to take a look at that process to fix that process to generate smaller sql text file instead of creating a new process. 也许您需要查看该过程以修复该过程以生成较小的sql文本文件，而不是创建一个新过程。 However, if this is a one time thing that you need to do then that might be fine but to address your question I would use the BufferedReader to read and process the records if it's a large file as you indicated. 但是，如果这是您需要做的一次性事情，那可能很好，但是要解决您的问题，我会使用BufferedReader来读取和处理记录（如果它是您指定的大文件）。

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
   // process the line. and write into your output file and close the file.
   }
br.close();

Java-使用扫描仪在Delimeter上拆分大型SQL文本文件（OutOfMemoryError）

问题描述

4 个解决方案

解决方案1
0 2013-12-03 14:37:27

解决方案2
0 2013-12-03 14:40:45

解决方案3
0 2013-12-03 14:45:56

解决方案4
0 已采纳 2013-12-03 14:46:35

Java-使用扫描仪在Delimeter上拆分大型SQL文本文件（OutOfMemoryError）

问题描述

4 个解决方案

解决方案1 0 2013-12-03 14:37:27

解决方案2 0 2013-12-03 14:40:45

解决方案3 0 2013-12-03 14:45:56

解决方案4 0 已采纳 2013-12-03 14:46:35

解决方案1
0 2013-12-03 14:37:27

解决方案2
0 2013-12-03 14:40:45

解决方案3
0 2013-12-03 14:45:56

解决方案4
0 已采纳 2013-12-03 14:46:35