简体   繁体   中英

Match DataFrame column value against another DataFrame column and count hits

I've two Spark DataFrames. Where df1 contains addresses and df2 streetnames, cities, regions etc.

df1 = spark.createDataFrame([
  ["001", "Luc  Krier","2363  Ryan Road, Long Lake South Dakota","2363RyanRoad,LongLakeSouthDakota"],
  ["002", "Jeanny  Thorn","2263 Patton Lane Raleigh North Carolina","2263PattonLaneRaleighNorthCarolina"],
  ["003", "Teddy E Beecher","2839 Hartland Avenue Fond Du Lac Wisconsin","2839HartlandAvenueFondDuLacWisconsin"],
  ["004", "Philippe  Schauss","1 Im Oberdorf Allemagne","1ImOberdorfAllemagne"],
 ["005", "Meindert I Tholen","Hagedoornweg 138 Amsterdam","Hagedoornweg138Amsterdam"]
]).toDF("id","name","address1", "address2")

df2 = spark.createDataFrame([
 ["US","Amsterdam"],
 ["US","SouthDakota"],
 ["LU","Allemagne"],
 ["FR","Allemagne"],
 ["NL","Amsterdam"],
 ["NL","Rotterdam"],
 ["US","Wisconsin"],
 ["AU","Wisconsin"],
 ["AU","Hartland"]
]).toDF("cc","point")

I want to check if df1['address2'] contains any of the values from df2['point'] and the expected result is (fictitious and not in accordance with the dataframe examples) a new column cc with values like:

('US':1)
('US':2)('NL':1)
('US':3)('FR':1)('LU':1)
('NL':1)

returns cc from df2['cc'] and the number of matches. An address can hit on multiple values from df2 . Sorted by number of matches (highest first)

You can perform a "conditional" join. Bet be aware, like @Steven mentioned in his comment, this will create a cross-join. Performance wise this will not be your best option. But just know that what you try to achieve is possible when you don't take performance into account.

df_join = df1.join(df2, df1.address2.contains(df2.point), how='left')
result = df_join
         .groupBy('id','name','address1', 'cc').count()
         .select('id', 'name', 'address1', f.concat(f.lit("'"), f.col("cc"), f.lit("':"), f.col("count")).alias('cc'))
         .groupBy('id','name','address1').agg(f.concat_ws("", f.collect_list(f.col("cc"))).alias('cc'))

What may help is that you broadcast df2 (the smallest one).

PySpark and broadcast join example

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM