繁体   English   中英

Spark Dataframe数据类型为String

[英]Spark Dataframe datatype as String

我试图通过编写describe作为SQL查询来验证DataFrame的数据类型,但是每次我将datetime作为字符串获取时。

1.首先我尝试了以下代码:

    SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
        Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");

        try {
    df.createTempView("data");
    Dataset<Row> sqlDf=sparkSession.sql("Describe data");
    sqlDf.show(300,false);

    Output:
    +-----------------+---------+-------+
    |col_name         |data_type|comment|
    +-----------------+---------+-------+
    |id               |int      |null   |
    |symbol           |string   |null   |
    |datetime         |string   |null   |
    |side             |string   |null   |
    |orderQty         |int      |null   |
    |price            |double   |null   | 
    +-----------------+---------+-------+
  1. 我也尝试自定义模式,但是在这种情况下,我执行除描述表以外的任何查询时都会遇到异常:

     SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv"); try { df.createTempView("trade_data"); Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data"); sqlDf.show(300,false); Output: +--------+---------+-------+ |col_name|data_type|comment| +--------+---------+-------+ |datetime|timestamp|null | |price |double |null | |orderQty|double |null | +--------+---------+-------+ 

但是,如果我尝试任何查询,则得到以下执行:

Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");


java.lang.IllegalArgumentException
        at java.sql.Date.valueOf(Date.java:143)
        at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)

如何解决呢?

  1. 为什么Inferschema不起作用?

  2. 如果您不想提交自己的架构,则一种方法是:

     Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv"); df.printSchema(); // check output - 1 df.createOrReplaceTempView("df"); Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime"); df1.printSchema(); // check output - 2 ==================================== output - 1: root |-- id: integer (nullable = true) |-- symbol: string (nullable = true) |-- datetime: string (nullable = true) |-- side: string (nullable = true) |-- orderQty: integer (nullable = true) |-- price: double (nullable = true) output - 2: root |-- id: integer (nullable = true) |-- symbol: string (nullable = true) |-- side: string (nullable = true) |-- orderQty: integer (nullable = true) |-- price: double (nullable = true) |-- datetime_d: date (nullable = true) 

    如果要投射的字段数量不多,我会选择此方法。

  3. 如果要提交自己的架构:

     List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>(); fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true)); fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true)); fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true)); StructType schema = DataTypes.createStructType(fields); Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv"); df.printSchema(); // output - 1 df.createOrReplaceTempView("df"); Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime"); df1.printSchema(); // output - 2 ====================================== output - 1: root |-- datetime: timestamp (nullable = true) |-- price: double (nullable = true) |-- orderQty: double (nullable = true) output - 2: root |-- price: double (nullable = true) |-- orderQty: double (nullable = true) |-- datetime_d: date (nullable = true) 

    自从它再次从时间戳转换为Date以来,我没有看到这种方法的太多使用。 但是,仍然可以将其放在此处以供以后使用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM