简体   繁体   English

从pyspark的dataframe列中删除最后一个管道分隔的值

[英]remove last pipe-delimited value from dataframe column in pyspark

I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4 . 我正在使用spark 2.1,并在dataframe列中包含类似于AB|12|XY|4 I want to create a new column by removing the last element, so it should show like AB|12|XY . 我想通过删除最后一个元素来创建新列,因此它应显示为AB|12|XY

I tried to split, rsplit did not work, so need some suggestion to get the desired output. 我尝试拆分,rsplit无效,因此需要一些建议以获取所需的输出。

Use the Spark SQL split function as follows: 如下使用Spark SQL split函数

>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+                   
|                c1|                  
+------------------+                  
|        AB|12|XY|4|                  
|11|22|33|44|remove|                  
+------------------+                  

>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0])  # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+ 

If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF): 如果您需要做一些无法使用内置函数实现的复杂操作,则可以定义自己的用户定义函数(UDF):

>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
...   return str.rsplit('|',1)[0]
...
>>> my_udf    = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))

>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+ 
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

Built-in SQL functions are preferred (also here ) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF. 首选内置SQL函数 (也在此处 ),因为您的数据不会在JVM进程和Python进程之间来回传递,这就是使用UDF时发生的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 PySpark 的 Dataframe 列中获取最后一个/分隔的值 - Get last / delimited value from Dataframe column in PySpark 将管道分隔文件转换为Spark DataFrame转换为CSV文件 - Converting pipe-delimited file to spark dataframe to CSV file 如何从pyspark的数据框列中获取第一个值和最后一个值? - how to get first value and last value from dataframe column in pyspark? 将分隔列拆分为 pyspark dataframe 中的新列 - split delimited column into new columns in pyspark dataframe 从具有 26 列的 pipe 分隔文件中跳过第一行和最后一行,并使用 scala 将其变为 dataframe - Skip first and last line from a pipe delimited file with 26 columns and make it to dataframe using scala 如何 select 基于日期时间列的 pySpark DataFrame 中的最后一个值 - How to select last value in a pySpark DataFrame based on a datetime column 如何使用正则表达式从 pyspark 数据框列中删除双引号 - How to remove double quotes from a pyspark dataframe column using regex 如何从 pyspark 中的 Spark dataframe 的列中删除引号“” - How to remove quotes “ ” from a column of a Spark dataframe in pyspark Pyspark dataframe 列值取决于另一行的值 - Pyspark dataframe column value dependent on value from another row 从DataFrame获取基于其他列值的值(PySpark) - Getting a value from DataFrame based on other column value (PySpark)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM