更改现有数据框的架构

Question

I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe. 我想更改现有数据框的架构，同时更改遇到错误的架构。是否可以更改数据框的现有架构。

val customSchema=StructType(
      Array(
        StructField("data_typ", StringType, nullable=false),
        StructField("data_typ", IntegerType, nullable=false),
        StructField("proc_date", IntegerType, nullable=false),
        StructField("cyc_dt", DateType, nullable=false),
        ));

val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode|         Description|monthColNam|     timeStampColNam|
+------------+--------------------+-----------+--------------------+
|       03099|Volumetric/Expand...|     201867|2018-05-31 18:25:...|
|       03307|  Elapsed Day Factor|     201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+

val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)

expected result 预期结果

val newdf=
    +------------+--------------------+-----------+--------------------+
    |data_typ_cd |       data_typ_desc|proc_dt    |     cyc_dt         |
    +------------+--------------------+-----------+--------------------+
    |       03099|Volumetric/Expand...|     201867|2018-05-31 18:25:...|
    |       03307|  Elapsed Day Factor|     201867|2018-05-31 18:25:...|
    +------------+--------------------+-----------+--------------------+

Any help will be appricated 任何帮助将被申请

Answer 1

You cannot change schema like this. 您不能像这样更改架构。 Schema object passed to createDataFrame has to match the data, not the other way around: 传递给createDataFrame模式对象必须匹配数据，而不是相反：

To parse timestamp data use corresponding functions, for example like Better way to convert a string field into timestamp in Spark 要解析时间戳数据，请使用相应的函数，例如在Spark中将字符串字段转换为时间戳的更好方法
To change other types use cast method, for example how to change a Dataframe column from String type to Double type in pyspark 要改变其他类型使用cast方法，例如如何从字符串类型一个数据帧列改变为在pyspark双型

Answer 2

You can do something like this to change the datatype from one to other. 您可以执行类似的操作将数据类型从一种更改为另一种。

I have created a dataframe similar to yours like below: 我创建了一个类似于您的数据框，如下所示：

import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._

var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
  .toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")

df.printSchema()
df.show()

This gives me the following output: 这给了我以下输出：

root
 |-- DatatypeCode: string (nullable = true)
 |-- data_typ: string (nullable = true)
 |-- proc_date: string (nullable = true)
 |-- cyc_dt: string (nullable = true)

+------------+--------------------+---------+-------------------+
|DatatypeCode|            data_typ|proc_date|             cyc_dt|
+------------+--------------------+---------+-------------------+
|       03099|Volumetric/Expand...|   201867|2018-05-31 18:25:00|
|       03307|  Elapsed Day Factor|   201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+

If you see the schema above all the columns are of type String. 如果您看到上方的架构，则所有列的类型均为String。 Now I want to change the column proc_date to Integer type and cyc_dt to Date type, I will do the following: 现在，我想将proc_date列proc_date为Integer类型，将cyc_dt列更改为Date类型，我将执行以下操作：

df = df.withColumnRenamed("DatatypeCode", "data_type_code")

df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")

df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")

and when you check the schema of this dataframe 当您检查此数据框的架构时

df.printSchema()

then it gives the output as following with the new column names: 然后使用新的列名给出输出，如下所示：

root
 |-- data_type_code: string (nullable = true)
 |-- data_typ: string (nullable = true)
 |-- proc_date_new: integer (nullable = true)
 |-- cyc_dt_new: date (nullable = true)

更改现有数据框的架构

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-05-31 13:29:23

解决方案2
1 2018-05-31 13:37:07

更改现有数据框的架构

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-05-31 13:29:23

解决方案2 1 2018-05-31 13:37:07

解决方案1
4 已采纳 2018-05-31 13:29:23

解决方案2
1 2018-05-31 13:37:07