简体   繁体   English

如何使用Java读取Spark中的xls和xlsx文件?

[英]how can I read xls and xlsx file in spark with java?

I want to read xls and xlsx (MS Excel) files row by row in spark like we do it for text files OR any how? 我想像火花一样逐行读取xls和xlsx(MS Excel)文件,就像我们对文本文件所做的那样,或者如何?

I want to use spark to increase performance for reading a large xls file say 1 GB, that's why I need spark to read the file in parts like we do for text files. 我想使用spark来提高读取1 GB大型xls文件的性能,这就是为什么我需要spark来读取文件的部分,就像处理文本文件一样。

How can I read data from excel files in spark whether it is line by line or not? 我如何从Spark中的excel文件中读取数据,无论它是逐行还是不逐行?

I just want to read entries in the xls file anyhow using spark. 我只想使用spark来读取xls文件中的条目。

Please suggest. 请提出建议。

Thanks!!! 谢谢!!!

You can not do that with spark . 您无法通过spark做到这一点。 It is not meant for it. 这不是为了它。 Use someother library eg Apache POI to read excel and then feed that data to spark as text. 使用其他库(例如Apache POI)来读取excel,然后将数据作为文本输入。

Though the question is bit old, i am still answering it. 尽管这个问题有点老了,但我仍在回答。 May be it will be useful to someone else. 可能对其他人有用。 The answer is yes you can do it with apache spark 2.x. 答案是肯定的,您可以使用Apache Spark 2.x来完成。 Let say you want to convert a xls with 3 columns to Dataset. 假设您要将3列的xls转换为Dataset。

  class Bean {
     private String col1;
     private String col2;   
     private Timestamp col3;
}

StructType structType= new StructType(new StructField[] {
                new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
                new StructField("col2", DataTypes.StringType, true, Metadata.empty()),
                new StructField("col3", DataTypes.TimestampType, true, Metadata.empty())
        });

Dataset<Bean> ds = sparkSession.read().
                schema(structType).
                format("com.crealytics.spark.excel").
                option("useHeader", true). // If the xls file has headers
                option("timestampFormat", "yyyy-MM-dd HH:mm:ss"). // If you want to convert timestamp to a specific format
                option("treatEmptyValuesAsNulls", "false").
                option("inferSchema", "false").
                option("addColorColumns", "false").
                load("/home/user/test/sample.xls"). //path to xls or xlsx
                as(Encoders.bean(Bean.class)); // Bean in which you want to convert the data, you can remove this line if Dataset<Row> is just fine for you

You can try the HadoopOffice library to read/write Excel files with Spark ( https://github.com/ZuInnoTe/hadoopoffice/wiki ). 您可以尝试使用HadoopOffice库使用Spark读取/写入Excel文件( https://github.com/ZuInnoTe/hadoopoffice/wiki )。 It supports encrypted Excel, linked workbooks, filtering by metadata ... 它支持加密的Excel,链接的工作簿,按元数据过滤...

Here is how I done. 这是我的工作方式。

In maven add dependencies 在Maven中添加依赖项

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.4.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.4.2</version>
    </dependency>
    <dependency>
        <groupId>com.crealytics</groupId>
        <artifactId>spark-excel_2.11</artifactId>
        <version>0.11.1</version>
    </dependency>
</dependencies>

My main class 我的主班

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ReadExcelSheets {

    public static void main(String[] args) {
        //skip logging extras
        Logger.getLogger("org").setLevel(Level.ERROR);

       //build session
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL Example")
                .config("spark.master", "local")
                .getOrCreate();

        //read excel - change file name
        Dataset<Row> df = spark.read()
                .format("com.crealytics.spark.excel")
                .option("useHeader", "true")
                //.option("dataAddress", "'Sheet1'!A1:M1470") // optional when you want to read sheets where A1 first top cell and M1470 us very bottom left of sheet.
                .load("datasets/test1.xlsx");
        //show your data
        df.show();
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Java中读取.xlsx和.xls文件? - How can I read .xlsx and .xls files in Java? 如何在Java 1.3中将.xlsx格式的Excel文件转换为.xls? - How can i convert excel file of format .xlsx to .xls in Java 1.3? 如何使用java中的spark从AWS S3读取.xls文件? 并且无法读取 sheetName - How to read an .xls file from AWS S3 using spark in java? And unable to read sheetName 如何在Java中转换Excel文件格式XLS和XLSX反之亦然 - How to convert the excel file format xls and xlsx vice versa in java 如何使用 Kotlin/Java 从我的 s3 存储桶中读取 Excel 文件 (xlsx)? - How can I read an Excel file (xlsx) from my s3 bucket with Kotlin/Java? 如何在Java中读取受密码保护的.xls文件? - How to read password protected .xls file in java? 如何在 android 工作室中从本地读取 excel 文件 xls 和 xlsx 我还需要使用带有行和列的集合执行搜索 - How to read excel file xls and xlsx from local in android studio also I need to perform search with set with row and cols 使用Java搜索xlsx和xls文件 - Search in xlsx and xls file using java Java 发送 xls 文件但浏览器将其识别为 xlsx - Java sending xls file but browsers recognising it as xlsx java中密码保护的xls/xlsx文件 - Password protected xls/xlsx file in java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM