[英]how can I read xls and xlsx file in spark with java?
I want to read xls and xlsx (MS Excel) files row by row in spark like we do it for text files OR any how? 我想像火花一样逐行读取xls和xlsx(MS Excel)文件,就像我们对文本文件所做的那样,或者如何?
I want to use spark to increase performance for reading a large xls file say 1 GB, that's why I need spark to read the file in parts like we do for text files. 我想使用spark来提高读取1 GB大型xls文件的性能,这就是为什么我需要spark来读取文件的部分,就像处理文本文件一样。
How can I read data from excel files in spark whether it is line by line or not? 我如何从Spark中的excel文件中读取数据,无论它是逐行还是不逐行?
I just want to read entries in the xls file anyhow using spark. 我只想使用spark来读取xls文件中的条目。
Please suggest. 请提出建议。
Thanks!!! 谢谢!!!
You can not do that with spark
. 您无法通过
spark
做到这一点。 It is not meant for it. 这不是为了它。 Use someother library eg Apache POI to read excel and then feed that data to spark as text.
使用其他库(例如Apache POI)来读取excel,然后将数据作为文本输入。
Though the question is bit old, i am still answering it. 尽管这个问题有点老了,但我仍在回答。 May be it will be useful to someone else.
可能对其他人有用。 The answer is yes you can do it with apache spark 2.x.
答案是肯定的,您可以使用Apache Spark 2.x来完成。 Let say you want to convert a xls with 3 columns to Dataset.
假设您要将3列的xls转换为Dataset。
class Bean {
private String col1;
private String col2;
private Timestamp col3;
}
StructType structType= new StructType(new StructField[] {
new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
new StructField("col2", DataTypes.StringType, true, Metadata.empty()),
new StructField("col3", DataTypes.TimestampType, true, Metadata.empty())
});
Dataset<Bean> ds = sparkSession.read().
schema(structType).
format("com.crealytics.spark.excel").
option("useHeader", true). // If the xls file has headers
option("timestampFormat", "yyyy-MM-dd HH:mm:ss"). // If you want to convert timestamp to a specific format
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").
load("/home/user/test/sample.xls"). //path to xls or xlsx
as(Encoders.bean(Bean.class)); // Bean in which you want to convert the data, you can remove this line if Dataset<Row> is just fine for you
You can try the HadoopOffice library to read/write Excel files with Spark ( https://github.com/ZuInnoTe/hadoopoffice/wiki ). 您可以尝试使用HadoopOffice库使用Spark读取/写入Excel文件( https://github.com/ZuInnoTe/hadoopoffice/wiki )。 It supports encrypted Excel, linked workbooks, filtering by metadata ...
它支持加密的Excel,链接的工作簿,按元数据过滤...
Here is how I done. 这是我的工作方式。
In maven add dependencies 在Maven中添加依赖项
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.11</artifactId>
<version>0.11.1</version>
</dependency>
</dependencies>
My main class 我的主班
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ReadExcelSheets {
public static void main(String[] args) {
//skip logging extras
Logger.getLogger("org").setLevel(Level.ERROR);
//build session
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.config("spark.master", "local")
.getOrCreate();
//read excel - change file name
Dataset<Row> df = spark.read()
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
//.option("dataAddress", "'Sheet1'!A1:M1470") // optional when you want to read sheets where A1 first top cell and M1470 us very bottom left of sheet.
.load("datasets/test1.xlsx");
//show your data
df.show();
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.