[英]Get last / delimited value from Dataframe column in PySpark
I am trying to get the last string after '/'.我试图获取'/'之后的最后一个字符串。
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz该列可以如下所示: “lala/mae.da/rg1/zzzzz” (不需要只有 3 /),我想返回: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.在 SQL 和 Python 中这很容易,但我想知道是否有办法在 PySpark 中做到这一点。
Solving it in Python:在 Python 中解决它:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:或直接:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:在 SQL 中:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:对于 PySpark,我在想这样的事情:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.但我收到以下错误: TypeError: 'Column' object is not callable我什至不确定 Spark 是否允许我执行 rsplit。
Do you have any suggestion on how can I solve this?你对我如何解决这个问题有什么建议吗?
import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+
Since Spark 2.4, you can use split
built-in function to split your string then use element_at
built-in function to get the last element of your obtained array, as follows:从 Spark 2.4 开始,您可以使用
split
内置函数来拆分字符串,然后使用element_at
内置函数来获取所获得数组的最后一个元素,如下所示:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))
Adding another solution even though @Pav3k's answer is great.尽管@Pav3k 的回答很好,但添加了另一个解决方案。
element_at
which gets an item at specific position out of a list: element_at
从列表中获取特定位置的项目:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF
used.使用了 Pav3k 的
DF
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.