如何在 java 中有效地解析巨大的 csv 文件

Question

我的應用程序當前正在使用 CSV Parser 來解析 csv 個文件並保存到數據庫。 它將整個csv加載到memory中並且需要大量時間來堅持，有時甚至會超時。 我在網站上看到
看到使用 Univocity 解析器的混合建議。 請建議處理大量數據的最佳方法，該方法花費的時間更少。
謝謝你。

代碼：

 int numRecords = csvParser.parse( fileBytes );

  public int parse(InputStream ins) throws ParserException {
    long parseTime=  System.currentTimeMillis();
    fireParsingBegin();
    ParserEngine engine = null;
    try {
        engine = (ParserEngine) getEngineClass().newInstance();
    } catch (Exception e) {
        throw new ParserException(e.getMessage());
    }
    engine.setInputStream(ins);
    engine.start();
    int count = parse(engine);
    fireParsingDone();
    long seconds = (System.currentTimeMillis() - parseTime) / 1000;
    System.out.println("Time taken is "+seconds);
    return count;
}


protected int parse(ParserEngine engine) throws ParserException {
    int count = 0;
    while (engine.next()) //valuesString Arr in Engine populated with cell data
    {
        if (stopParsing) {
            break;
        }

        Object o = parseObject(engine); //create individual Tos
        if (o != null) {
            count++; //count is increased after every To is formed
            fireObjectParsed(o, engine); //put in into Bo/COl and so valn preparations
        }
        else {
            return count;
        }
    }
    return count;

Answer 1

使用Apache的Commons CSV庫。

Answer 2

univocity-parsers是加載CSV文件的最佳選擇，您可能無法更快地編寫任何代碼。 您遇到的問題可能來自兩點：

1-將所有內容加載到內存中。 通常這是一個錯誤的設計決定，但是如果這樣做，請確保為應用程序分配足夠的內存。 例如，使用標志-Xms8G和Xmx8G提供更多內存。

2-您可能未批處理插入語句。

我的建議是嘗試一下（使用univocity解析器）：

    //configure input format using
    CsvParserSettings settings = new CsvParserSettings();

    //get an interator
    CsvParser parser = new CsvParser(settings);
    Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

    //connect to the database and create an insert statement
    Connection connection = getYourDatabaseConnectionSomehow();
    final int COLUMN_COUNT = 2;
    PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)"); 

    //run batch inserts of 1000 rows per batch
    int batchSize = 0;
    while (it.hasNext()) {
        //get next row from parser and set values in your statement
        String[] row = it.next(); 
        for(int i = 0; i < COLUMN_COUNT; i++){ 
            if(i < row.length){
                statement.setObject(i + 1, row[i]);
            } else { //row in input is shorter than COLUMN_COUNT
                statement.setObject(i + 1, null);   
            }
        }

        //add the values to the batch
        statement.addBatch();
        batchSize++;

        //once 1000 rows made into the batch, execute it
        if (batchSize == 1000) {
            statement.executeBatch();
            batchSize = 0;
        }
    }
    // the last batch probably won't have 1000 rows.
    if (batchSize > 0) {
        statement.executeBatch();
    }

這應該很快執行，您甚至不需要100mb的內存即可運行。

為了清楚起見，我沒有使用任何try / catch / finally塊來關閉此處的任何資源。 您的實際代碼必須處理該問題。

希望能幫助到你。

Answer 3

流媒體 Apache Commons IO

try (LineIterator it = FileUtils.lineIterator(theFile, "UTF-8")) {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
}

如何在 java 中有效地解析巨大的 csv 文件

問題描述

3 個解決方案

解決方案1
0 2018-10-29 16:14:38

解決方案2
0 已采納 2018-10-30 01:53:13

解決方案3
0 2022-03-01 14:51:57

如何在 java 中有效地解析巨大的 csv 文件

問題描述

3 個解決方案

解決方案1 0 2018-10-29 16:14:38

解決方案2 0 已采納 2018-10-30 01:53:13

解決方案3 0 2022-03-01 14:51:57

解決方案1
0 2018-10-29 16:14:38

解決方案2
0 已采納 2018-10-30 01:53:13

解決方案3
0 2022-03-01 14:51:57