[英]Change schema of existing dataframe
I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe. 我想更改现有数据框的架构,同时更改遇到错误的架构。是否可以更改数据框的现有架构。
val customSchema=StructType(
Array(
StructField("data_typ", StringType, nullable=false),
StructField("data_typ", IntegerType, nullable=false),
StructField("proc_date", IntegerType, nullable=false),
StructField("cyc_dt", DateType, nullable=false),
));
val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode| Description|monthColNam| timeStampColNam|
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)
expected result 预期结果
val newdf=
+------------+--------------------+-----------+--------------------+
|data_typ_cd | data_typ_desc|proc_dt | cyc_dt |
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
Any help will be appricated 任何帮助将被申请
You cannot change schema like this. 您不能像这样更改架构。 Schema object passed to
createDataFrame
has to match the data, not the other way around: 传递给
createDataFrame
模式对象必须匹配数据,而不是相反:
cast
method, for example how to change a Dataframe column from String type to Double type in pyspark cast
方法,例如如何从字符串类型一个数据帧列改变为在pyspark双型 You can do something like this to change the datatype from one to other. 您可以执行类似的操作将数据类型从一种更改为另一种。
I have created a dataframe similar to yours like below: 我创建了一个类似于您的数据框,如下所示:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._
var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
.toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")
df.printSchema()
df.show()
This gives me the following output: 这给了我以下输出:
root
|-- DatatypeCode: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date: string (nullable = true)
|-- cyc_dt: string (nullable = true)
+------------+--------------------+---------+-------------------+
|DatatypeCode| data_typ|proc_date| cyc_dt|
+------------+--------------------+---------+-------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:00|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+
If you see the schema above all the columns are of type String. 如果您看到上方的架构,则所有列的类型均为String。 Now I want to change the column
proc_date
to Integer
type and cyc_dt
to Date
type, I will do the following: 现在,我想将
proc_date
列proc_date
为Integer
类型,将cyc_dt
列更改为Date
类型,我将执行以下操作:
df = df.withColumnRenamed("DatatypeCode", "data_type_code")
df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")
df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")
and when you check the schema of this dataframe 当您检查此数据框的架构时
df.printSchema()
then it gives the output as following with the new column names: 然后使用新的列名给出输出,如下所示:
root
|-- data_type_code: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date_new: integer (nullable = true)
|-- cyc_dt_new: date (nullable = true)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.