简体   繁体   English

如何在 java 中有效地解析巨大的 csv 文件

[英]How to parse huge csv file efficiently in java

My application is currently using CSV Parser to parse csv files and persist to database.我的应用程序当前正在使用 CSV Parser 来解析 csv 个文件并保存到数据库。 It loads the entire csv into memory and taking a lot of time to persist, sometimes even times out.它将整个csv加载到memory中并且需要大量时间来坚持,有时甚至会超时。 I have seen on the site我在网站上看到
seeing mixed recommendations to use Univocity parser.看到使用 Univocity 解析器的混合建议。 Please advice the best approach to process large amounts of data which takes less time.请建议处理大量数据的最佳方法,该方法花费的时间更少。
Thank you.谢谢你。

Code:代码:

 int numRecords = csvParser.parse( fileBytes );

  public int parse(InputStream ins) throws ParserException {
    long parseTime=  System.currentTimeMillis();
    fireParsingBegin();
    ParserEngine engine = null;
    try {
        engine = (ParserEngine) getEngineClass().newInstance();
    } catch (Exception e) {
        throw new ParserException(e.getMessage());
    }
    engine.setInputStream(ins);
    engine.start();
    int count = parse(engine);
    fireParsingDone();
    long seconds = (System.currentTimeMillis() - parseTime) / 1000;
    System.out.println("Time taken is "+seconds);
    return count;
}


protected int parse(ParserEngine engine) throws ParserException {
    int count = 0;
    while (engine.next()) //valuesString Arr in Engine populated with cell data
    {
        if (stopParsing) {
            break;
        }

        Object o = parseObject(engine); //create individual Tos
        if (o != null) {
            count++; //count is increased after every To is formed
            fireObjectParsed(o, engine); //put in into Bo/COl and so valn preparations
        }
        else {
            return count;
        }
    }
    return count;

使用Apache的Commons CSV库。

univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. univocity-parsers是加载CSV文件的最佳选择,您可能无法更快地编写任何代码。 The problems you are having come from possibly 2 things: 您遇到的问题可能来自两点:

1 - loading everything in memory. 1-将所有内容加载到内存中。 That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. 通常这是一个错误的设计决定,但是如果这样做,请确保为应用程序分配足够的内存。 Give it more memory using flags -Xms8G and Xmx8G for example. 例如,使用标志-Xms8GXmx8G提供更多内存。

2 - you are probably not batching your insert statements. 2-您可能未批处理插入语句。

My suggestion is to try this (using univocity-parsers): 我的建议是尝试一下(使用univocity解析器):

    //configure input format using
    CsvParserSettings settings = new CsvParserSettings();

    //get an interator
    CsvParser parser = new CsvParser(settings);
    Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

    //connect to the database and create an insert statement
    Connection connection = getYourDatabaseConnectionSomehow();
    final int COLUMN_COUNT = 2;
    PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)"); 

    //run batch inserts of 1000 rows per batch
    int batchSize = 0;
    while (it.hasNext()) {
        //get next row from parser and set values in your statement
        String[] row = it.next(); 
        for(int i = 0; i < COLUMN_COUNT; i++){ 
            if(i < row.length){
                statement.setObject(i + 1, row[i]);
            } else { //row in input is shorter than COLUMN_COUNT
                statement.setObject(i + 1, null);   
            }
        }

        //add the values to the batch
        statement.addBatch();
        batchSize++;

        //once 1000 rows made into the batch, execute it
        if (batchSize == 1000) {
            statement.executeBatch();
            batchSize = 0;
        }
    }
    // the last batch probably won't have 1000 rows.
    if (batchSize > 0) {
        statement.executeBatch();
    }

This should execute pretty quickly and you won't need not even 100mb of memory to run. 这应该很快执行,您甚至不需要100mb的内存即可运行。

For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. 为了清楚起见,我没有使用任何try / catch / finally块来关闭此处的任何资源。 Your actual code must handle that. 您的实际代码必须处理该问题。

Hope it helps. 希望能帮助到你。

Streaming with Apache Commons IO流媒体 Apache Commons IO

try (LineIterator it = FileUtils.lineIterator(theFile, "UTF-8")) {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM