[英]Pyspark: Implement lambda function and udf from Python to Pyspark
我有一个 dataframe 我正在应用一个 lambda function 根据列的值复制行值。
在 Pandas 中,它看起来像这样:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': ['one', 'two', 'three', 'five']})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': ['five', 'six', nan, nan]})
new_df = df1.merge(df2, how='left', left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo one foo five
1 foo one foo NaN
2 bar two bar six
3 baz three baz NaN
4 foo five foo five
5 foo five foo NaN
def my_func(row):
if not row['value_y'] in [nan]:
row['value_x'] = row['value_y']
return row
applied_df = new_df.apply(lambda x: my_func(x), axis=1)
lkey value_x rkey value_y
0 foo five foo five
1 foo one foo NaN
2 bar six bar six
3 baz three baz NaN
4 foo five foo five
5 foo five foo NaN
我将如何在 Pyspark 中做类似的事情?
尝试这个:
from pyspark.sql import functions as F
df1.withColumnRenamed("value","value_x")\
.join(df2.withColumnRenamed("value","value_y"),F.col("lkey")==F.col("rkey"),'left')\
.withColumn("value_x", F.when(F.col("value_y").isNotNull(),F.col("value_y")).otherwise(F.col("value_x"))).show()
#+----+-------+----+-------+
#|lkey|value_x|rkey|value_y|
#+----+-------+----+-------+
#| bar| six| bar| six|
#| foo| five| foo| five|
#| foo| one| foo| null|
#| foo| five| foo| five|
#| foo| five| foo| null|
#| baz| three| baz| null|
#+----+-------+----+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.