简体   繁体   English

使用spark中的验证来转换Dataframe列

[英]Casting the Dataframe columns with validation in spark

I need to cast the column of the data frame containing values as all string to a defined schema data types. 我需要将包含值的数据框的列作为所有字符串转换为定义的模式数据类型。 While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column 在进行转换时,我们需要将损坏的记录(错误数据类型的记录)放入单独的列中

Example of Dataframe 数据帧的示例

+---+----------+-----+
|id |name      |class|
+---+----------+-----+
|1  |abc       |21   |
|2  |bca       |32   |
|3  |abab      | 4   |
|4  |baba      |5a   |
|5  |cccca     |     |
+---+----------+-----+

Json Schema of the file: 文件的Json Schema:

 {"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}

In this row 4 is corrupt records as the class column is of type Integer So only this records has to be there in corrupt records, not the 5th row 在这一行中4是损坏的记录,因为类列的类型为Integer所以只有这些记录必须存在于损坏的记录中,而不是第5行

Just check if value is NOT NULL before casting and NULL after casting 只需在转换前检查值是否NOT NULL ,并在转换后检查NULL

import org.apache.spark.sql.functions.when

df
  .withColumn("class_integer", $"class".cast("integer"))
  .withColumn(
    "class_corrupted", 
    when($"class".isNotNull and $"class_integer".isNull, $"class"))

Repeat for each column / cast you need. 重复您需要的每个列/演员表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM