[英]working with arraytype in spark Dataframe
My requirement is to cast all Decimal data type in DataFrame to String. 我的要求是将DataFrame中的所有Decimal数据类型都转换为String。 Logic is working fine with simple type but not working with ArrayType. 逻辑在使用简单类型时可以正常工作,但在ArrayType上则不能工作。 Here is logic :- 这是逻辑:
var df = spark.sql("select * from test_1")
for(dt <- df.dtypes) {
if(dt._2.substring(0,7) == "Decimal"){
df = df.withColumn(dt._1,df(dt._1).cast("String"))
}
}
But column within arrayType remains unchanged although, they are decimal type. 但是arrayType内的列尽管是十进制类型,但仍保持不变。 please help , how can I loop through nested element and cast it to string. 请帮助,如何遍历嵌套元素并将其转换为字符串。 This is schema of my dataframe: 这是我的数据框的架构:
scala> df.schema res77: org.apache.spark.sql.types.StructType = StructType(StructField(mstr_prov_id,StringType,true), StructField(prov_ctgry_cd,StringType,true), StructField(prov_orgnl_efctv_dt,TimestampType,true), StructField(prov_trmntn_dt,TimestampType,true), StructField(prov_trmntn_rsn_cd,StringType,true), StructField(npi_rqrd_ind,StringType,true), StructField(prov_stts_aray_txt,ArrayType(StructType(StructField(PROV_STTS_KEY,DecimalType(22,0),true), StructField(PROV_STTS_EFCTV_DT,TimestampType,true), StructField(PROV_STTS_CD,StringType,true), StructField(PROV_STTS_TRMNTN_DT,TimestampType,true), StructField(PROV_STTS_TRMNTN_RSN_CD,StringType,true)),true),true)) scala> df.schema res77:org.apache.spark.sql.types.StructType = StructType(StructField(mstr_prov_id,StringType,true),StructField(prov_ctgry_cd,StringType,true),StructField(prov_orgnl_efctv_dt,TimestampType, prov_trmntn_dt,TimestampType,true),StructField(prov_trmntn_rsn_cd,StringType,true),StructField(npi_rqrd_ind,StringType,true),StructField(prov_stts_aray_txt,ArrayType(StructType(StructField(PROV_STTS_KEY)(CV_STTS_KEY) ,TimestampType,true),StructField(PROV_STTS_CD,StringType,true),StructField(PROV_STTS_TRMNTN_DT,TimestampType,true),StructField(PROV_STTS_TRMNTN_RSN_CD,StringType,true),true),true))
you can also cast complex types, eg if you have a dataframe like this schema: 您还可以转换复杂的类型,例如,如果您有一个像这样的架构的数据框:
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- i: decimal(22,0) (nullable = true)
| | |-- j: double (nullable = false)
you can cast all array-elements of type decimal (field i
n this example) by doing: 你可以投型小数的所有数组元素(场i
通过执行此实例):
df
.select($"arr".cast("array<struct<i:string,j:double>>"))
.printSchema()
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- i: string (nullable = true)
| | |-- j: double (nullable = true)
EDIT: If you don't know the schema in advance, you can just replace the decimal
in the original schema with string
: 编辑:如果您事先不知道该架构,则可以将原始架构中的decimal
替换为string
:
val arraySchema = df.schema.fields(0).dataType.simpleString
val castedSchema = arraySchema.replaceAll("decimal\\(.*\\)","string")
df
.select($"arr".cast(castedSchema))
.show()
If you are using spark 2.1 and above then following casting should work for you 如果您使用的是Spark 2.1及更高版本,则后续投射应该适合您
val newSchema = DataType.fromJson(df.schema.json.replaceAll("(decimal\\(\\d+,\\d+\\))", "string")).asInstanceOf[StructType]
df.select(newSchema.map(field => col(field.name).cast(field.dataType)): _*)
which should cast all the decimal types to string type. 应该将所有十进制类型转换为字符串类型。
But if you are using spark version lower than the mentioned and since there is timestamp datatype in the struct column you will encounter 但是,如果您使用的火花版本低于上述版本,并且由于struct列中有timestamp数据类型,您将遇到
TimestampType (of class org.apache.spark.sql.types.TimestampType$) scala.MatchError: TimestampType (of class org.apache.spark.sql.types.TimestampType$)
Its a casting structs fails on Timestamp fields and resolved cast struct with timestamp field fails 它的强制转换结构在Timestamp字段上失败,并且解析的强制结构 与timestamp字段失败
Try this (your comparison with == is probably not what you want) 试试看(您与==的比较可能不是您想要的)
var df = spark.sql("select * from test_1")
for(dt <- df.dtypes) {
if("Decimal".equalsIgnoreCase(dt._2.substring(0,Math.min(7, dt._2.length)))){
df = df.withColumn(dt._1,df(dt._1).cast("String"))
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.