Apache Spark Dataset API - 不接受模式StructType

Question

I have the following class which loads a headerless CSV file using the Spark data API. 我有以下类使用Spark数据API加载无头CSV文件。

The problem I have is that I cannot get the SparkSession to accept a schema StructType which should define each column. 我遇到的问题是我不能让SparkSession接受应该定义每列的模式StructType。 Resulting Dataframe is unamed columns of String type 生成的Dataframe是String类型的未命名列

public class CsvReader implements java.io.Serializable {

public CsvReader(StructType builder) {
        this.builder = builder;
    }
private StructType builder;

SparkConf conf = new SparkConf().setAppName("csvParquet").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);

Dataset<Row> df = sparkSession
        .read()
        .format("com.databricks.spark.csv")
        .option("header", false)
        //.option("inferSchema", true)
        .schema(builder)
        .load("/Users/Chris/Desktop/Meter_Geocode_Data.csv"); //TODO: CMD line arg

public void printSchema() {
    System.out.println(builder.length());
    df.printSchema();
}

public void printData() {
    df.show();
}

public void printMeters() {
    df.select("meter").show();
}

public void printMeterCountByGeocode_result() {
    df.groupBy("geocode_result").count().show();
}

public Dataset getDataframe() {
            return df;
 }

}

Resulting dataframe schema is: 产生的数据帧架构是：

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)

Debugger shows that the 'builder' StrucType is correctly defined: 调试器显示正确定义了'builder'StrucType：

0 = {StructField@4904} "StructField(geocode_result,DoubleType,false)"
1 = {StructField@4905} "StructField(meter,StringType,false)"
2 = {StructField@4906} "StructField(orig_easting,StringType,false)"
3 = {StructField@4907} "StructField(orig_northing,StringType,false)"
4 = {StructField@4908} "StructField(temetra_easting,StringType,false)"
5 = {StructField@4909} "StructField(temetra_northing,StringType,false)"
6 = {StructField@4910} "StructField(orig_address,StringType,false)"
7 = {StructField@4911} "StructField(orig_postcode,StringType,false)"
8 = {StructField@4912} "StructField(postcode_easting,StringType,false)"
9 = {StructField@4913} "StructField(postcode_northing,StringType,false)"
10 = {StructField@4914} "StructField(distance_calc_method,StringType,false)"
11 = {StructField@4915} "StructField(distance,StringType,false)"
12 = {StructField@4916} "StructField(geocoded_address,StringType,false)"
13 = {StructField@4917} "StructField(geocoded_postcode,StringType,false)"

What am I doing wrong? 我究竟做错了什么？ Any help massively appreciated! 任何帮助都非常感谢！

Answer 1

Define variable Dataset<Row> df and move the code block for reading CSV file inside getDataframe() method like below. 定义变量Dataset<Row> df并移动代码块以读取getDataframe()方法中的CSV文件，如下所示。

private Dataset<Row> df = null;

public Dataset getDataframe() {
    df = sparkSession
        .read()
        .format("com.databricks.spark.csv")
        .option("header", false)
        //.option("inferSchema", true)
        .schema(builder)
        .load("src/main/java/resources/test.csv"); //TODO: CMD line arg
        return df;
}

Now you can call it like below. 现在你可以像下面这样调用它。

    CsvReader cr = new CsvReader(schema);
    Dataset df = cr.getDataframe();
    cr.printSchema();

I would suggest you to redesign your class. 我建议你重新设计你的课程。 One option is you can pass df, to other methods, as parameter. 一种选择是你可以将df作为参数传递给其他方法。 If you are using Spark 2.0 then you don't need SparkConf. 如果您使用的是Spark 2.0，则不需要SparkConf。 Please refer documentation to create SparkSession. 请参阅文档以创建SparkSession。

Answer 2

如果要通过构建器初始化它，则应将df放在构造函数中。或者可以将其放在成员函数中。

Apache Spark Dataset API - 不接受模式StructType

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-04-25 11:57:21

解决方案2
0 2017-04-25 11:04:29

Apache Spark Dataset API - 不接受模式StructType

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-04-25 11:57:21

解决方案2 0 2017-04-25 11:04:29

解决方案1
3 已采纳 2017-04-25 11:57:21

解决方案2
0 2017-04-25 11:04:29