[英]remove last pipe-delimited value from dataframe column in pyspark
I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4
. 我正在使用spark 2.1,并在dataframe列中包含类似于
AB|12|XY|4
。 I want to create a new column by removing the last element, so it should show like AB|12|XY
. 我想通过删除最后一个元素来创建新列,因此它应显示为
AB|12|XY
。
I tried to split, rsplit did not work, so need some suggestion to get the desired output. 我尝试拆分,rsplit无效,因此需要一些建议以获取所需的输出。
Use the Spark SQL split
function as follows: 如下使用Spark SQL
split
函数 :
>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+
| c1|
+------------------+
| AB|12|XY|4|
|11|22|33|44|remove|
+------------------+
>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0]) # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
| c1| c2|
+------------------+-----------+
| AB|12|XY|4| AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+
If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF): 如果您需要做一些无法使用内置函数实现的复杂操作,则可以定义自己的用户定义函数(UDF):
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
... return str.rsplit('|',1)[0]
...
>>> my_udf = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df = spark.read.json(sc.parallelize(json_data))
>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+
| c1| c2|
+------------------+-----------+
| AB|12|XY|4| AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+
Built-in SQL functions are preferred (also here ) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF. 首选内置SQL函数 (也在此处 ),因为您的数据不会在JVM进程和Python进程之间来回传递,这就是使用UDF时发生的情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.