简体   繁体   English

在Spark Dataframe中使用arraytype

[英]working with arraytype in spark Dataframe

My requirement is to cast all Decimal data type in DataFrame to String. 我的要求是将DataFrame中的所有Decimal数据类型都转换为String。 Logic is working fine with simple type but not working with ArrayType. 逻辑在使用简单类型时可以正常工作,但在ArrayType上则不能工作。 Here is logic :- 这是逻辑:

var df = spark.sql("select * from test_1")
for(dt <- df.dtypes) {
  if(dt._2.substring(0,7) == "Decimal"){
    df = df.withColumn(dt._1,df(dt._1).cast("String"))  
  }
}

But column within arrayType remains unchanged although, they are decimal type. 但是arrayType内的列尽管是十进制类型,但仍保持不变。 please help , how can I loop through nested element and cast it to string. 请帮助,如何遍历嵌套元素并将其转换为字符串。 This is schema of my dataframe: 这是我的数据框的架构:

scala> df.schema res77: org.apache.spark.sql.types.StructType = StructType(StructField(mstr_prov_id,StringType,true), StructField(prov_ctgry_cd,StringType,true), StructField(prov_orgnl_efctv_dt,TimestampType,true), StructField(prov_trmntn_dt,TimestampType,true), StructField(prov_trmntn_rsn_cd,StringType,true), StructField(npi_rqrd_ind,StringType,true), StructField(prov_stts_aray_txt,ArrayType(StructType(StructField(PROV_STTS_KEY,DecimalType(22,0),true), StructField(PROV_STTS_EFCTV_DT,TimestampType,true), StructField(PROV_STTS_CD,StringType,true), StructField(PROV_STTS_TRMNTN_DT,TimestampType,true), StructField(PROV_STTS_TRMNTN_RSN_CD,StringType,true)),true),true)) scala> df.schema res77:org.apache.spark.sql.types.StructType = StructType(StructField(mstr_prov_id,StringType,true),StructField(prov_ctgry_cd,StringType,true),StructField(prov_orgnl_efctv_dt,TimestampType, prov_trmntn_dt,TimestampType,true),StructField(prov_trmntn_rsn_cd,StringType,true),StructField(npi_rqrd_ind,StringType,true),StructField(prov_stts_aray_txt,ArrayType(StructType(StructField(PROV_STTS_KEY)(CV_STTS_KEY) ,TimestampType,true),StructField(PROV_STTS_CD,StringType,true),StructField(PROV_STTS_TRMNTN_DT,TimestampType,true),StructField(PROV_STTS_TRMNTN_RSN_CD,StringType,true),true),true))

you can also cast complex types, eg if you have a dataframe like this schema: 您还可以转换复杂的类型,例如,如果您有一个像这样的架构的数据框:

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- i: decimal(22,0) (nullable = true)
 |    |    |-- j: double (nullable = false)

you can cast all array-elements of type decimal (field i n this example) by doing: 你可以投型小数的所有数组元素(场i通过执行此实例):

df
  .select($"arr".cast("array<struct<i:string,j:double>>"))
  .printSchema()

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- i: string (nullable = true)
 |    |    |-- j: double (nullable = true)

EDIT: If you don't know the schema in advance, you can just replace the decimal in the original schema with string : 编辑:如果您事先不知道该架构,则可以将原始架构中的decimal替换为string

val arraySchema = df.schema.fields(0).dataType.simpleString
val castedSchema = arraySchema.replaceAll("decimal\\(.*\\)","string")

df
  .select($"arr".cast(castedSchema))
  .show()

If you are using spark 2.1 and above then following casting should work for you 如果您使用的是Spark 2.1及更高版本,则后续投射应该适合您

val newSchema = DataType.fromJson(df.schema.json.replaceAll("(decimal\\(\\d+,\\d+\\))", "string")).asInstanceOf[StructType]
df.select(newSchema.map(field => col(field.name).cast(field.dataType)): _*)

which should cast all the decimal types to string type. 应该将所有十进制类型转换为字符串类型。

But if you are using spark version lower than the mentioned and since there is timestamp datatype in the struct column you will encounter 但是,如果您使用的火花版本低于上述版本,并且由于struct列中有timestamp数据类型,您将遇到

TimestampType (of class org.apache.spark.sql.types.TimestampType$) scala.MatchError: TimestampType (of class org.apache.spark.sql.types.TimestampType$)

Its a casting structs fails on Timestamp fields and resolved cast struct with timestamp field fails 它的强制转换结构在Timestamp字段上失败,并且解析的强制结构 与timestamp字段失败

Try this (your comparison with == is probably not what you want) 试试看(您与==的比较可能不是您想要的)

var df = spark.sql("select * from test_1")
for(dt <- df.dtypes) {
  if("Decimal".equalsIgnoreCase(dt._2.substring(0,Math.min(7, dt._2.length)))){
    df = df.withColumn(dt._1,df(dt._1).cast("String"))  
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取火花数据帧中 ArrayType 列的不同元素 - get the distinct elements of an ArrayType column in a spark dataframe 将Spark Dataframe列转换为仅一行的数据框(ArrayType) - Transforming a Spark Dataframe Column into a Dataframe with just one line (ArrayType) 从Spark数据框列中ArrayType类型的行中获取不同的元素 - Get distinct elements from rows of type ArrayType in Spark dataframe column 从现有的数组类型列创建单独的 Spark 数据帧 - Creating Separate Spark dataframe from existing arraytype column 从 Spark Dataframe 的 ArrayType 列中删除 Scala 中的空列表 - Remove empty lists in Scala from ArrayType column in Spark Dataframe Spark:递归“ ArrayType Column =&gt; ArrayType Column”功能 - Spark: Recursive 'ArrayType Column => ArrayType Column' function 如何在 spark DataFrame 中将多个浮点列连接到一个 ArrayType(FloatType()) 中? - How can I concat several float columns into one ArrayType(FloatType()) in spark DataFrame? 创建一个包含数千列的 Spark dataframe,然后添加一个包含所有列的 ArrayType 列 - Create a Spark dataframe with thousands of columns and then add a column of ArrayType that hold them all 在Spark Scala中创建ArrayType列 - Create ArrayType column in Spark scala 数据框中ArrayType列之间的差异 - Difference between columns of ArrayType in dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM