简体   繁体   中英

split column on the last occurrence of a delimiter into two columns based on column condition in spark dataframe

I have the following spark dataframe

+------+--------------------+-----------------+--------------------+
|entity|            instance|             name|               value|
+------+--------------------+-----------------+--------------------+
|Column|            ARM_PRED|  Histogram.abs. |            638065.0|
|Column|            ARM_PRED|Histogram.ratio. |                 1.0|
|Column|       ATTR044_Count|       Compliance|  0.9886814039322013|
|Column|ATTR090_MissingCount|       Compliance|  0.9997570780406385|
|Column|             ATTR045|             Mean|  17101.922291929834|
|Column|             ATTR045|          Maximum|           4000000.0|
|Column|             ATTR045|          Minimum|                 0.0|
|Column|             ATTR045|StandardDeviation|   32981.67451994775|
|Column|       ATTR020_Count|       Compliance|    0.99814909139351|
|Column|ATTR036_MissingCount|       Compliance|0.011260608245241473|
|Column|       ATTR011_Count|       Compliance|    0.99814909139351|
|Column|             ATTR069|             Mean|  0.8678189434810735|
|Column|             ATTR069|          Maximum|               103.0|

When name column has Compliance value. I have to split instance column based on last occurrence of delimiter "_" and place values in instance column and name column.

Expected output:

+------+--------------------+-----------------+--------------------+
|entity|            instance|             name|               value|
+------+--------------------+-----------------+--------------------+
|Column|            ARM_PRED|  Histogram.abs. |            638065.0|
|Column|            ARM_PRED|Histogram.ratio. |                 1.0|
|Column|             ATTR044|            Count|  0.9886814039322013|
|Column|             ATTR090|     MissingCount|  0.9997570780406385|
|Column|             ATTR045|             Mean|  17101.922291929834|
|Column|             ATTR045|          Maximum|           4000000.0|
|Column|             ATTR045|          Minimum|                 0.0|
|Column|             ATTR045|StandardDeviation|   32981.67451994775|
|Column|             ATTR020|            Count|    0.99814909139351|
|Column|             ATTR036|     MissingCount|0.011260608245241473|
|Column|             ATTR011|            Count|    0.99814909139351|
|Column|             ATTR069|             Mean|  0.8678189434810735|
|Column|             ATTR069|          Maximum|               103.0|

Below code performs operation on all records, but i need to perform only when name column has Compliance value

df.withColumn("instance", split($"instance", "_(?!.*_)"))
   .withColumn("instance", $"instance"(0))
   .withColumn("name", $"instance"(1))).show

Need help!

You can use when function .

Example:

df
 .withColumn("_instance", split($"instance", "_(?!.*_)"))
 .withColumn("instance", when($"name" === "Compliance", $"_instance"(0)).otherwise($"instance"))
 .withColumn("name", when($"name" === "Compliance", $"_instance"(1)).otherwise($"name"))
 .drop("_instance")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM