[英]Pyspark dataframe : remove cumulative pairs from pyspark dataframe
我想刪除具有相同ID的對,僅將其中一對保留在數據框中。
我也不能通過'id'刪除重復項,因為我可能對同一個“ id”具有多個組合,而這些組合可能不是累積對示例:我在python中嘗試了以下方法,但不確定如何在pyspark中使用它,任何幫助都是贊賞。
m_f_1['value'] = m_f_1.apply(lambda x: str(x['value_x']) + str(x['value_y']) if x['value_x'] > x['value_y'] else str(x['value_y']) + str(x['value_x']), axis =1)
輸入數據幀m_f_1為:
id value.x value.y
100057 38953993985 38993095846
100057 38993095845 38953993985
100057 38993095845 38993095846
100057 38993095846 38953993985
100011 38989281716 38996868028
100011 38996868028 38989281716
100019 38916115350 38994231881
100019 38994231881 38916115350
輸出應為
頭(RES)
id value.x value.y
100011 38989281716 38996868028
100019 38916115350 38994231881
100031 38911588267 38993358322
100057 38953993985 38993095846
100057 38993095845 38953993985
100057 38993095845 38993095846
您可以使用pyspark.sql.functions
來實現。 pyspark.sql.functions.greatest
和pyspark.sql.functions.least
分別取最大值和最小值。 pyspark.sql.functions.concat
用於連接字符串。
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [(100057,38953993985,38993095846)
, (100057,38993095845,38953993985)
, (100057,38993095845,38993095846)
, (100057,38993095846,38953993985)
, (100011,38989281716,38996868028)
, (100011,38996868028,38989281716)
, (100019,38916115350,38994231881)
, (100019,38994231881,38916115350)]
m_f_1 = sqlContext.createDataFrame(data, schema=['id','value_x','value_y'])
m_f_1 = m_f_1.withColumn('value', F.concat(F.greatest('value_x','value_y').cast('string')
,F.least('value_x','value_y').cast('string')))
m_f_1 = m_f_1.dropDuplicates(subset=['value']).drop('value').sort('id')
m_f_1.show(truncate=False)
+------+-----------+-----------+
|id |value_x |value_y |
+------+-----------+-----------+
|100011|38989281716|38996868028|
|100019|38916115350|38994231881|
|100057|38993095845|38953993985|
|100057|38953993985|38993095846|
|100057|38993095845|38993095846|
+------+-----------+-----------+
即使您希望從兩個以上的列中獲得唯一性,這也應該起作用。
df = spark.createDataFrame([(100057,38953993985,38993095846),(100057,38993095845,38953993985),(100057,38993095845,38993095846),(100057,38993095846,38953993985),(100011,38989281716,38996868028),(100011,38996868028,38989281716),(100019,38916115350,38994231881),(100019,38994231881,38916115350)],['id','value_x','value_y'])
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def list_sort(x,y):
return sorted([x,y])
udf_list_sort = udf(list_sort, ArrayType(IntegerType()))
spark.udf.register("udf_list_sort",udf_list_sort)
df1 = df.selectExpr("id","udf_list_sort(value_x,value_y) as value_x_y").distinct()
df1.selectExpr("id AS id",
"value_x_y[0] AS value_x",
"value_x_y[1] AS value_y").show()
#+------+---------+---------+
#| id| value_x| value_y|
#+------+---------+---------+
#|100019|261409686|339526217|
#|100011|334576052|342162364|
#|100057|299288321|338390182|
#|100057|299288321|338390181|
#|100057|338390181|338390182|
#+------+---------+---------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.