简体   繁体   English

Apache Spark Dataset API - 不接受模式StructType

[英]Apache Spark Dataset API - Does not accept schema StructType

I have the following class which loads a headerless CSV file using the Spark data API. 我有以下类使用Spark数据API加载无头CSV文件。

The problem I have is that I cannot get the SparkSession to accept a schema StructType which should define each column. 我遇到的问题是我不能让SparkSession接受应该定义每列的模式StructType。 Resulting Dataframe is unamed columns of String type 生成的Dataframe是String类型的未命名列

public class CsvReader implements java.io.Serializable {

public CsvReader(StructType builder) {
        this.builder = builder;
    }
private StructType builder;

SparkConf conf = new SparkConf().setAppName("csvParquet").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);

Dataset<Row> df = sparkSession
        .read()
        .format("com.databricks.spark.csv")
        .option("header", false)
        //.option("inferSchema", true)
        .schema(builder)
        .load("/Users/Chris/Desktop/Meter_Geocode_Data.csv"); //TODO: CMD line arg

public void printSchema() {
    System.out.println(builder.length());
    df.printSchema();
}

public void printData() {
    df.show();
}

public void printMeters() {
    df.select("meter").show();
}

public void printMeterCountByGeocode_result() {
    df.groupBy("geocode_result").count().show();
}

public Dataset getDataframe() {
            return df;
 }

}

Resulting dataframe schema is: 产生的数据帧架构是:

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)

Debugger shows that the 'builder' StrucType is correctly defined: 调试器显示正确定义了'builder'StrucType:

0 = {StructField@4904} "StructField(geocode_result,DoubleType,false)"
1 = {StructField@4905} "StructField(meter,StringType,false)"
2 = {StructField@4906} "StructField(orig_easting,StringType,false)"
3 = {StructField@4907} "StructField(orig_northing,StringType,false)"
4 = {StructField@4908} "StructField(temetra_easting,StringType,false)"
5 = {StructField@4909} "StructField(temetra_northing,StringType,false)"
6 = {StructField@4910} "StructField(orig_address,StringType,false)"
7 = {StructField@4911} "StructField(orig_postcode,StringType,false)"
8 = {StructField@4912} "StructField(postcode_easting,StringType,false)"
9 = {StructField@4913} "StructField(postcode_northing,StringType,false)"
10 = {StructField@4914} "StructField(distance_calc_method,StringType,false)"
11 = {StructField@4915} "StructField(distance,StringType,false)"
12 = {StructField@4916} "StructField(geocoded_address,StringType,false)"
13 = {StructField@4917} "StructField(geocoded_postcode,StringType,false)"

What am I doing wrong? 我究竟做错了什么? Any help massively appreciated! 任何帮助都非常感谢!

Define variable Dataset<Row> df and move the code block for reading CSV file inside getDataframe() method like below. 定义变量Dataset<Row> df并移动代码块以读取getDataframe()方法中的CSV文件,如下所示。

private Dataset<Row> df = null;

public Dataset getDataframe() {
    df = sparkSession
        .read()
        .format("com.databricks.spark.csv")
        .option("header", false)
        //.option("inferSchema", true)
        .schema(builder)
        .load("src/main/java/resources/test.csv"); //TODO: CMD line arg
        return df;
}

Now you can call it like below. 现在你可以像下面这样调用它。

    CsvReader cr = new CsvReader(schema);
    Dataset df = cr.getDataframe();
    cr.printSchema();

I would suggest you to redesign your class. 我建议你重新设计你的课程。 One option is you can pass df, to other methods, as parameter. 一种选择是你可以将df作为参数传递给其他方法。 If you are using Spark 2.0 then you don't need SparkConf. 如果您使用的是Spark 2.0,则不需要SparkConf。 Please refer documentation to create SparkSession. 请参阅文档以创建SparkSession。

如果要通过构建器初始化它,则应将df放在构造函数中。或者可以将其放在成员函数中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM