简体   繁体   English

如何使用大型csv文件

[英]how to work with large csv file

I have a very huge csv file and I have to use some select query, getting avg,... I can not do that normally by reading line by line, because of out of memory. 我有一个非常庞大的csv文件,我必须使用一些选择查询,得到平均,...我不能通过逐行读取,因为内存不足。

the following code work well on a short csv file but not for huge one. 以下代码适用于短csv文件但不适用于较大的csv文件。 I will appreciate if you can edit this code to use for large csv file. 如果您可以编辑此代码以用于大型csv文件,我将不胜感激。

import java.io.File;

import java.io.FileNotFoundException;
import java.util.Scanner;


public class Mu {
    public void Computemu()
    {
        String filename="testdata.csv";
        File file=new File(filename);
        try {
            Scanner inputstream=new Scanner(file);//Scanner read only string 
            // String data=inputstream.next();//Ignore the first line(header)
            double sum=0;
            double numberOfRating=0;

            while (inputstream.hasNext())
            {                       
               String data=inputstream.next();//get a whole line
                String[] values= data.split(";");//values separate by;
                double rating=Double.parseDouble(values[2].replaceAll("\"", ""));//change value to string
                if(rating>0)//do not consider implicit ratings
                {
                    sum+=rating;
                    numberOfRating++;
                }
            }
            inputstream.close();
            System.out.println("Mu is"+ (sum/numberOfRating));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
    }
}

You didn't call useDelimiter so the next() methods must load the whole file into a string if it hasn't a white space (the default delimiter). 您没有调用useDelimiter,因此如果next()方法没有空格(默认分隔符),则必须将整个文件加载到字符串中。

This leads to an OutOfMemory Error. 这会导致OutOfMemory错误。

If you want to use a Scanner, set the delimiter according to your needs. 如果要使用扫描仪,请根据需要设置分隔符。

But a CSV library (like csvfile would probably be more efficient. 但是一个CSV库(比如csvfile可能会更有效率)。

I suggest the use of Apache Commons FileUtil for this use case. 我建议在这个用例中使用Apache Commons FileUtil。 This may not be what you are looking for in your question, but FileUtil usage is preferable to re-implementing it. 这可能不是您在问题中寻找的内容,但FileUtil的使用比重新实现它更可取。

Specifically, please look at lineIterator method. 详细的,请看lineIterator方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM