I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4
. I want to create a new column by removing the last element, so it should show like AB|12|XY
.
I tried to split, rsplit did not work, so need some suggestion to get the desired output.
Use the Spark SQL split
function as follows:
>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+
| c1|
+------------------+
| AB|12|XY|4|
|11|22|33|44|remove|
+------------------+
>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0]) # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
| c1| c2|
+------------------+-----------+
| AB|12|XY|4| AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+
If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
... return str.rsplit('|',1)[0]
...
>>> my_udf = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df = spark.read.json(sc.parallelize(json_data))
>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+
| c1| c2|
+------------------+-----------+
| AB|12|XY|4| AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+
Built-in SQL functions are preferred (also here ) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.