简体   繁体   English

从CSV中的非常大的源解析CSV文件到数组

[英]Parsing CSV files to arrays from very large sources in java

I have a parser that works fine on smaller files of approx. 我有一个解析器,可以在大约较小的文件上正常工作。 60000 lines or less but I have to parse a CSV file with over 10 million lines and this method just isn't working it hangs every 100 thousand lines for 10 seconds and I assume its the split method, Is there a faster way to parse data from a CSV to a string array? 60000行或更少,但我必须解析一个超过1000万行的CSV文件,这个方法不工作它每10万行挂起10秒钟我假设它的拆分方法,是否有更快的方法来解析数据从CSV到字符串数组?

Code in question: 有问题的代码:

    String[][] events = new String[rows][columns];
    Scanner sc = new Scanner(csvFileName);

    int j = 0;
    while (sc.hasNext()){
        events[j] = sc.nextLine().split(",");
        j++;
    }

your code won't parse CSV files reliably. 您的代码不会可靠地解析CSV文件。 What if you had ',' or a line separator in a value? 如果您在值中有','或行分隔符怎么办? This is also very slow. 这也很慢。

Get uniVocity-parsers to parse your files. 获取uniVocity解析器来解析您的文件。 It is 3 times faster than Apache Commons CSV, has many more features and we use it to process files with billions of rows. 它比Apache Commons CSV快3倍,具有更多功能,我们用它来处理数十亿行的文件。

To parse all rows into a list of Strings: 要将所有行解析为字符串列表:

CsvParserSettings settings = new CsvParserSettings(); //lots of options here, check the documentation

CsvParser parser = new CsvParser(settings);

List<String[]> allRows = parser.parseAll(new FileReader(new File("path/to/input.csv")));

Disclosure: I am the author of this library. 披露:我是这个图书馆的作者。 It's open-source and free (Apache V2.0 license). 它是开源和免费的(Apache V2.0许可证)。

as a rule of thumb, using libraries is usually more efficient than in-house development. 根据经验,使用库通常比内部开发更有效。 There are several libraries that provide reading/parsing csv files. 有几个库提供读取/解析csv文件。 One of the more popular ones is Apache Commons CSV 其中一个比较受欢迎的是Apache Commons CSV

You might want to try a library I've just released: sesseltjonna-csv 您可能想尝试我刚刚发布的库: sesseltjonna-csv

It dynamically generates a CSV parser + databinding at runtime using ASM for improved performance. 它使用ASM在运行时动态生成CSV解析器+数据绑定,以提高性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM