![](/img/trans.png)
[英]PySpark: match the values of a DataFrame column against another DataFrame column
[英]Match DataFrame column value against another DataFrame column and count hits
我有兩個 Spark DataFrame。 其中df1
包含地址和df2
街道名稱、城市、地區等。
df1 = spark.createDataFrame([
["001", "Luc Krier","2363 Ryan Road, Long Lake South Dakota","2363RyanRoad,LongLakeSouthDakota"],
["002", "Jeanny Thorn","2263 Patton Lane Raleigh North Carolina","2263PattonLaneRaleighNorthCarolina"],
["003", "Teddy E Beecher","2839 Hartland Avenue Fond Du Lac Wisconsin","2839HartlandAvenueFondDuLacWisconsin"],
["004", "Philippe Schauss","1 Im Oberdorf Allemagne","1ImOberdorfAllemagne"],
["005", "Meindert I Tholen","Hagedoornweg 138 Amsterdam","Hagedoornweg138Amsterdam"]
]).toDF("id","name","address1", "address2")
df2 = spark.createDataFrame([
["US","Amsterdam"],
["US","SouthDakota"],
["LU","Allemagne"],
["FR","Allemagne"],
["NL","Amsterdam"],
["NL","Rotterdam"],
["US","Wisconsin"],
["AU","Wisconsin"],
["AU","Hartland"]
]).toDF("cc","point")
我想檢查 df1['address2'] 是否包含來自 df2['point'] 的任何值,並且預期結果是(虛構的,不符合 dataframe 示例)一個新列cc
,其值如下:
('US':1)
('US':2)('NL':1)
('US':3)('FR':1)('LU':1)
('NL':1)
從df2['cc']
返回cc
和匹配數。 一個地址可以命中來自df2
的多個值。 按匹配數排序(最高優先)
您可以執行“條件”連接。 請注意,就像@Steven在他的評論中提到的那樣,這將創建一個交叉連接。 性能方面,這將不是您的最佳選擇。 但是要知道,當您不考慮性能時,您嘗試實現的目標是可能的。
df_join = df1.join(df2, df1.address2.contains(df2.point), how='left')
result = df_join
.groupBy('id','name','address1', 'cc').count()
.select('id', 'name', 'address1', f.concat(f.lit("'"), f.col("cc"), f.lit("':"), f.col("count")).alias('cc'))
.groupBy('id','name','address1').agg(f.concat_ws("", f.collect_list(f.col("cc"))).alias('cc'))
可能有幫助的是您廣播 df2 (最小的)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.