[英]How to add a new column to an existing dataframe while also specifying the datatype of it?
I have a dataframe: yearDF
obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS. 我有一个数据
yearDF
: yearDF
通过读取Postgres的RDBMS表获得的,我需要将其提取到HDFS的Hive表中。
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
Before ingesting it, I have to add a new column: delete_flag
of datatype: IntegerType
to it. 咽下它之前,我必须添加一个新列:
delete_flag
数据类型: IntegerType
它。 This column is used to mark a primary-key whether the row is deleted in the source table or not. 此列用于标记主键是否在源表中删除该行。 To add a new column to an existing dataframe, I know that there is the option:
dataFrame.withColumn("del_flag",someoperation)
but there is no such option to specify the datatype of new column. 要将新列添加到现有数据帧中,我知道可以使用以下选项:
dataFrame.withColumn("del_flag",someoperation)
但是没有这样的选项来指定新列的数据类型。
I have written the StructType for the new column as: 我将新列的StructType编写为:
val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))
But I don't understand how to add this column with the existing dataFrame: yearDF
. 但是我不明白如何用现有的dataFrame:
yearDF
添加此列。 Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ? 谁能让我知道如何将新列及其数据类型添加到现有dataFrame中?
import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()
Though casting is not required if you are passing lit(1) as spark will infer the schema for you. 虽然如果您传递lit(1)则不需要强制转换,因为spark会为您推断模式。 But if you are passing as lit("1") it will cast it to Int
但是,如果您以lit(“ 1”)的身份传递,则将其转换为Int
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.