简体   繁体   English

Spark Java编辑列中的数据

[英]Spark Java edit data in column

I would like to iterate through the content of a column in a spark DataFrame and correct the data within a cell if it meets a certain condition 我想遍历spark DataFrame中一列的内容,并在满足特定条件的情况下更正单元格中的数据

+-------------+
|column_title |
+-------------+
+-----+
|null |
+-----+
+-----+
|0    |
+-----+
+-----+
|1    |
+-----+

Lets say I want to display something else when value of column is null, I tried with 假设我想在column值为null时显示其他内容,

Column.when() DataSet.withColumn() Column.when() DataSet.withColumn()

But I cant find the right method, i don't think it would be necessary to convert to RDD and iterate through it. 但是我找不到正确的方法,我认为没有必要转换为RDD并对其进行迭代。

You can use when and equalTo or when and isNull . 您可以使用whenequalTowhenisNull

Dataset<Row> df1 = df.withColumn("value", when(col("value").equalTo("bbb"), "ccc").otherwise(col("value")));

Dataset<Row> df2 = df.withColumn("value", when(col("value").isNull(), "ccc").otherwise(col("value")));

If you only want to replace null values then you can also use na and fill . 如果只想替换空值,则还可以使用nafill

Dataset<Row> df3 = df.na().fill("ccc");

Another way of doing this could be by using UDF. 完成此操作的另一种方法是使用UDF。

Create a UDF. 创建一个UDF。

    private static UDF1 myUdf = new UDF1<String, String>() {
    public String call(final String str) throws Exception {
        // any condition or custom function can be used
        return StringUtils.rightPad(str, 25, 'A');
      }
    };

Register UDF in SparkSession. 在SparkSession中注册UDF。

    sparkSession.udf().register("myUdf", myUdf, DataTypes.StringType);

Apply udf on dataset. 将udf应用于数据集。

   Dataset<Row> dataset = dataset.withColumn("city", functions.callUDF("myudf", col("city")));

Hope it helps ! 希望能帮助到你 !

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM