[英]extract substring before first occurrence and substring after last occurrence of a delimiter in Pyspark
我在pyspark
中有一个如下所示的数据框
df = spark.createDataFrame(
[
('14_100_00','A',25),
('13_100_00','B',24),
('15_100_00','A',20),
('150_100','C',21),
('16','A',20),
('1634_100_00_01','B',22),
('1_100_00','C',23),
('18_100_00','D',24)],("rust", "name", "value"))
df.show()
+--------------+----+-----+
| rust|name|value|
+--------------+----+-----+
| 14_100_00| A| 25|
| 13_100_00| B| 24|
| 15_100_00| A| 20|
| 150_100| C| 21|
| 16| A| 20|
|1634_100_00_01| B| 22|
| 1_100_00| C| 23|
| 18_100_00| D| 24|
+--------------+----+-----+
我正在尝试使用以下条件使用rust
列创建一个新列
1) extract anything before 1st underscore
2) extract anything after the last underscore
3) concatenate the above two values using tilda(~)
If no underscores in the column then have column as is
我试过如下
from pyspark.sql.functions import substring_index
df1 = df.select("*", f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1)).alias("extract"))
df1.show()
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16~16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
预期结果:
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
我怎样才能达到我想要的
使用instr
function判断rust
列是否包含_
,然后使用when
function进行处理。
df1 = df.select("*",
f.when(f.instr(df.rust, '_') > 0,
f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1))
)
.otherwise(df.rust)
.alias("extract")
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.