簡體   English   中英

讀取帶有日期和非日期數字的非常大的 Excel 文件

[英]Read very large Excel file with date and non-date numbers

我需要讀取一個非常大的 Excel 文件,它既有日期又有日期數字。 我發現的所有示例似乎都可以做一個或另一個(將單元格標識為日期值或在常量內存中讀取文件)。

似乎適用於超大文件的唯一解決方案是此處描述的 StreamingReader 方法(此處描述的其他示例要么不適用於我擁有的文件格式,要么出現內存不足的堆錯誤)。

在java中讀取巨大的Excel文件(500K行)

http://poi.apache.org/components/spreadsheet/how-to.html#event_api

我正在做的讀取文件如下所示。 帶有 test-excel.xmls(一個小測試文件)的整個示例在 github 中可用:

https://github.com/greshje/example-poi-streaming

POM.XML:

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <build.version>1.0.4-001</build.version>
</properties>

<modelVersion>4.0.0</modelVersion>
<groupId>com.greshje.examples</groupId>
<artifactId>poi-streaming-example</artifactId>
<version>1.0.4-SNAPSHOT</version>
<packaging>jar</packaging>

<!-- 
*
* dependencies
*
-->

<dependencies>
    
    <!-- JUNIT https://mvnrepository.com/artifact/junit/junit -->
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
        <scope>test</scope>
    </dependency>
    
    <!-- JUNIT-TOOLBOX https://mvnrepository.com/artifact/com.googlecode.junit-toolbox/junit-toolbox -->
    <dependency>
        <groupId>com.googlecode.junit-toolbox</groupId>
        <artifactId>junit-toolbox</artifactId>
        <version>2.4</version>
        <scope>test</scope>
    </dependency>

    <!-- SLF4J LOGBACK CLASSIC https://mvnrepository.com/artifact/ch.qos.logback/logback-classic -->
    <dependency>
        <groupId>ch.qos.logback</groupId>
        <artifactId>logback-classic</artifactId>
        <version>1.2.3</version>
    </dependency>

    <!-- POI https://mvnrepository.com/artifact/org.apache.poi/poi -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>4.1.2</version>
    </dependency>

    <!-- POI-OOXML https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>4.1.2</version>
    </dependency>

    <!-- XERCES https://mvnrepository.com/artifact/xerces/xerces -->
    <dependency>
        <groupId>xerces</groupId>
        <artifactId>xerces</artifactId>
        <version>2.4.0</version>
    </dependency>

    <!-- XERCES-IMPL https://mvnrepository.com/artifact/xerces/xercesImpl -->
    <dependency>
        <groupId>xerces</groupId>
        <artifactId>xercesImpl</artifactId>
        <version>2.12.0</version>
    </dependency>

    <!-- XLSX-STREAMER https://mvnrepository.com/artifact/com.monitorjbl/xlsx-streamer -->
    <dependency>
        <groupId>com.monitorjbl</groupId>
        <artifactId>xlsx-streamer</artifactId>
        <version>0.2.3</version>
    </dependency>

</dependencies>

<!-- 
*
* build
*
-->

<build>
    <plugins>
        <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.7.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <version>2.3.2</version>
            <!--  
            <configuration> 
                <finalName></finalName>                   
            </configuration>
            -->
        </plugin>      
    </plugins>
</build>

Java代碼:

package com.greshje.example.poi.streaming;

import java.io.InputStream;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.monitorjbl.xlsx.StreamingReader;

public class PoiStreamingExample {

    private static final Logger log = LoggerFactory.getLogger(PoiStreamingExample.class);

    private static final String FILE_NAME = "/com/greshje/example/poi/streaming/test-file.xlsx";

    public static void main(String[] args) {
        log.info("Starting test...");
        log.info("Getting file");
        InputStream in = PoiStreamingExample.class.getResourceAsStream(FILE_NAME);
        log.info("Got file");
        StreamingReader reader = getReader(in, 0);
        log.info("File contents:");
        for (Row row : reader) {
            String rowString = "";
            for (Cell cell : row) {
                if (rowString != "") {
                    rowString += ",";
                }
                // NEED A WAY TO GET A DATE WHERE APPROPRIATE HERE
                rowString += cell.getStringCellValue();
            }
            log.info(rowString);
        }
        log.info("Done.");
    }

    public static StreamingReader getReader(InputStream in, int sheetIndex) {
        try {
            StreamingReader reader = StreamingReader.builder()
                    .rowCacheSize(100) // number of rows to keep in memory (defaults to 10)
                    .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)
                    .sheetIndex(sheetIndex) // index of sheet to use
                    .read(in); // read the file
            return reader;
        } catch (Exception exp) {
            throw new RuntimeException(exp);
        }
    }

}

我的測試數據如下所示:

在此處輸入圖片說明

輸出看起來像這樣(日期而不是日期值表示為數字)。

2020-09-06 10:47:13,814 10:47:13.814 [main] INFO  (PoiStreamingExample.java:19) - Starting test...
2020-09-06 10:47:13,822 10:47:13.822 [main] INFO  (PoiStreamingExample.java:20) - Getting file
2020-09-06 10:47:13,823 10:47:13.823 [main] INFO  (PoiStreamingExample.java:22) - Got file
2020-09-06 10:47:15,117 10:47:15.117 [main] INFO  (PoiStreamingExample.java:24) - File contents:
2020-09-06 10:47:15,149 10:47:15.149 [main] INFO  (PoiStreamingExample.java:33) - Number,Date (mostly),Date (mostly)
2020-09-06 10:47:15,150 10:47:15.150 [main] INFO  (PoiStreamingExample.java:33) - 123456,43550
2020-09-06 10:47:15,150 10:47:15.150 [main] INFO  (PoiStreamingExample.java:33) - 123456,43685,44019
2020-09-06 10:47:15,150 10:47:15.150 [main] INFO  (PoiStreamingExample.java:33) - 123456,43522,43535
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 123456,43503,43538
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 123456,43535,43564
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 123456,43536,43574
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43553,43700
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,44041
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43521,43550
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43558,43580
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43567,43599
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43633,43641
2020-09-06 10:47:15,151 10:47:15.151 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43573,43615
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43577,43606
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43719,43754
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 7890123,43634,43641
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 123,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - smith,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - jones,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:33) - 43550,43550
2020-09-06 10:47:15,152 10:47:15.152 [main] INFO  (PoiStreamingExample.java:35) - Done.

- - 編輯 - - - - - - - - - - - -

我更新了 xls-stream 版本,但仍然沒有問題更新到新版本已修復!!!

<!-- XLSX-STREAMER https://mvnrepository.com/artifact/com.monitorjbl/xlsx-streamer -->
<dependency>
    <groupId>com.monitorjbl</groupId>
    <artifactId>xlsx-streamer</artifactId>
    <version>2.1.0</version>
</dependency>

這是舊版本:似乎沒有任何信息可以確定細胞類型:

在此處輸入圖片說明

此外,似乎不支持獲取單元格類型的代碼(在舊版本中)

在此處輸入圖片說明

新版本有更多的單元格信息,並處理日期和數字,給出已接受答案中顯示的結果。

使用最新版本的Excel Streaming Reader 2.1.0 ,這個問題就消失了。

使用您的test-file.xlsx和以下代碼:

import java.io.InputStream;

import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;

import com.monitorjbl.xlsx.StreamingReader;

public class PoiStreamingExample {

 private static final String FILE_NAME = "./test-file.xlsx";

 public static void main(String[] args) {

  try (
   InputStream is = PoiStreamingExample.class.getResourceAsStream(FILE_NAME);
   Workbook workbook = StreamingReader.builder()
    .rowCacheSize(100)
    .bufferSize(4096)
    .open(is)) {
   Sheet sheet =  workbook.getSheetAt(0);
   for (Row r : sheet) {
    String rowString = "";
    for (Cell c : r) {
     if (rowString != "") {
      rowString += ",";
     } 
     rowString += c.getStringCellValue();
    }
    System.out.println(rowString);
   }
  } catch (Exception ex) {
   ex.printStackTrace();
  }

 }
}

它打印:

Number,Date (mostly),Date (mostly)
123456,3/26/19
123456,8/8/19,7/7/20
123456,2/26/19,3/11/19
123456,2/7/19,3/14/19
123456,3/11/19,4/9/19
123456,3/12/19,4/19/19
7890123,3/29/19,8/23/19
7890123,7/29/20
7890123,2/25/19,3/26/19
7890123,4/3/19,4/25/19
7890123,4/12/19,5/14/19
7890123,6/17/19,6/25/19
7890123,4/18/19,5/30/19
7890123,4/22/19,5/21/19
7890123,9/11/19,10/16/19
7890123,6/18/19,6/25/19
123,43550
smith,43550
jones,43550
43550,43550

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM