如何在向现有数据框添加新列的同时指定其数据类型？

Question

I have a dataframe: yearDF obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS. 我有一个数据yearDF ： yearDF通过读取Postgres的RDBMS表获得的，我需要将其提取到HDFS的Hive表中。

  val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                         .option("dbtable", s"(${execQuery}) as year2017")
                         .option("user", devUserName)
                         .option("password", devPassword)
                         .option("numPartitions",10)
                         .load()

Before ingesting it, I have to add a new column: delete_flag of datatype: IntegerType to it. 咽下它之前，我必须添加一个新列： delete_flag数据类型： IntegerType它。 This column is used to mark a primary-key whether the row is deleted in the source table or not. 此列用于标记主键是否在源表中删除该行。 To add a new column to an existing dataframe, I know that there is the option: dataFrame.withColumn("del_flag",someoperation) but there is no such option to specify the datatype of new column. 要将新列添加到现有数据帧中，我知道可以使用以下选项： dataFrame.withColumn("del_flag",someoperation)但是没有这样的选项来指定新列的数据类型。

I have written the StructType for the new column as: 我将新列的StructType编写为：

val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))

But I don't understand how to add this column with the existing dataFrame: yearDF . 但是我不明白如何用现有的dataFrame： yearDF添加此列。 Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ? 谁能让我知道如何将新列及其数据类型添加到现有dataFrame中？

Answer 1

import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()

Though casting is not required if you are passing lit(1) as spark will infer the schema for you. 虽然如果您传递lit（1）则不需要强制转换，因为spark会为您推断模式。 But if you are passing as lit("1") it will cast it to Int 但是，如果您以lit（“ 1”）的身份传递，则将其转换为Int

如何在向现有数据框添加新列的同时指定其数据类型？

问题描述

1 个解决方案

解决方案1
0 2018-08-29 21:25:55

如何在向现有数据框添加新列的同时指定其数据类型？

问题描述

1 个解决方案

解决方案1 0 2018-08-29 21:25:55

解决方案1
0 2018-08-29 21:25:55