remove last pipe-delimited value from dataframe column in pyspark

Question

I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4 . I want to create a new column by removing the last element, so it should show like AB|12|XY .

I tried to split, rsplit did not work, so need some suggestion to get the desired output.

Answer 1

Use the Spark SQL split function as follows:

>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+                   
|                c1|                  
+------------------+                  
|        AB|12|XY|4|                  
|11|22|33|44|remove|                  
+------------------+                  

>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0])  # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):

>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
...   return str.rsplit('|',1)[0]
...
>>> my_udf    = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))

>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+ 
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

Built-in SQL functions are preferred (also here ) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.

remove last pipe-delimited value from dataframe column in pyspark

Question

1 answers

solution1
1 2018-07-28 17:35:03

remove last pipe-delimited value from dataframe column in pyspark

Question

1 answers

solution1 1 2018-07-28 17:35:03

solution1
1 2018-07-28 17:35:03