简体   繁体   English

PySpark - 如何遍历数据帧并匹配另一个数据帧中的另一个常见值

[英]PySpark - How to loop through the dataframe and match against another common value in another dataframe

This is a recommender system and I have a Dataframe which contains about 10 recommended item for each user ( recommendation_df ) and I have another Dataframe which consist of the recent purchases of each user ( recent_df ).这是一个推荐系统,我有一个Dataframe ,其中包含为每个用户推荐的大约 10 个项目( recommendation_df ),我还有另一个数据框,其中包含每个用户recent_df Dataframe

I am trying to code out this task but I can't seem to get along the syntax, and the manipulation我正在尝试编写此任务,但我似乎无法理解语法和操作

I am implementing a hit/miss ratio, basically for every new_party_id in recent_df , if any of the merch_store_code matches the merch_store_code for the same party_id in the recommendation_df , count +=1 (Hit)我正在实现一个命中/未命中率,基本上是针对 recent_df 中的recent_df new_party_id如果任何recommendation_dfmerch_store_code中相同party_idmerch_store_code匹配,则count +=1 (命中)

Then calculating the hit/miss ratio by using count/total user count然后使用count/total user count计算命中/未命中率

(However in recent_df, each user might have multiple recent purchases, but if any of the purchases is in the list of recommendations_list for the same user, take it as a hit (count +=1) (然而在recent_df中,每个用户可能有多个最近的购买,但如果任何购买在同一用户的recommens_list列表中,则将其视为命中(计数+=1)

recommendation_df推荐_df

+--------------+----------------+-----------+----------+
|party_id_index|merch_store_code|     rating|  party_id|
+--------------+----------------+-----------+----------+
|           148|       900000166|  0.4021678|G18B00332C|
|           148|       168339566| 0.27687865|G18B00332C|
|           148|       168993309| 0.15999989|G18B00332C|
|           148|       168350313|  0.1431974|G18B00332C|
|           148|       168329726| 0.13634883|G18B00332C|
|           148|       168351967|0.120235085|G18B00332C|
|           148|       168993312| 0.11800903|G18B00332C|
|           148|       168337234|0.116267696|G18B00332C|
|           148|       168993256| 0.10836013|G18B00332C|
|           148|       168339482| 0.10341005|G18B00332C|
|           463|       168350313| 0.93455887|K18M926299|
|           463|       900000072|  0.8275664|K18M926299|
|           463|       700012303| 0.70220494|K18M926299|
|           463|       700012180| 0.23209469|K18M926299|
|           463|       900000157|  0.1727839|K18M926299|
|           463|       700013689| 0.13854747|K18M926299|
|           463|       900000166| 0.12866624|K18M926299|
|           463|       168993284|0.107065596|K18M926299|
|           463|       168993269| 0.10272527|K18M926299|
|           463|       168339566| 0.10256036|K18M926299|
+--------------+----------------+-----------+----------+

recent_df最近_df

+------------+---------------+----------------+
|new_party_id|recent_purchase|merch_store_code|
+------------+---------------+----------------+
|  A11275842R|     2022-05-21|       168289403|
|  A131584211|     2022-06-01|       168993311|
|  A131584211|     2022-06-01|       168349493|
|  A131584211|     2022-06-01|       168350192|
|  A182P3539K|     2022-03-26|       168341707|
|  A182V2883F|     2022-05-26|       168350824|
|  A183B5482P|     2022-05-10|       168993464|
|  A183C6900K|     2022-05-14|       168338795|
|  A183D56093|     2022-05-20|       700012303|
|  A183J5388G|     2022-03-18|       700013650|
|  A183U8880P|     2022-04-01|       900000072|
|  A183U8880P|     2022-04-01|       168991904|
|  A18409762L|     2022-05-10|       168319352|
|  A18431276J|     2022-05-14|       168163905|
|  A18433684M|     2022-03-21|       168993324|
|  A18433978F|     2022-05-20|       168341876|
|  A184410389|     2022-05-04|       900000166|
|  A184716280|     2022-04-06|       700013653|
|  A18473797O|     2022-05-24|       168330339|
|  A18473797O|     2022-05-24|       168350592|
+------------+---------------+----------------+

Here is my current coding logic:这是我当前的编码逻辑:

count = 0
def hitratio(recommendation_df, recent_df):
 for i in recent_df['new_party_id']:
  for j  in recommendation_df['party_id']:
   if (i = j) & i.merch_store_code == j.merch_store_code:
    count += 1
  return (count/recent_df.count())

assumption : i am taking all the count rows of recent df as denominator for calculating the hit/miss ratio you can change the formula.假设:我将最近 df 的所有计数行作为计算命中/未命中率的分母,您可以更改公式。

from pyspark.sql import functions as F
matching_cond = ((recent_df["merch_store_code"]==recommender_df["merch_store_code"]) &(recommendation_df["party_id"].isNotNull()))

df_recent_fnl= df_recent.join(recommendation_df,df_recent["new_party_id"]==recommendation_df["party_id"],"left")\
.select(df_recent["*"],recommender_df["merch_store_code"],recommendation_df["party_id"])\
.withColumn("hit",F.when(matching_cond,F.lit(True)).otherwise(F.lit(False)))\
.withColumn("hit/miss",df_recent_fnl.filter(F.col("hit")).count()/df_recent.count())

do let me know if you have any questions around this .如果您对此有任何疑问,请告诉我。

If you like my solution , you can upvote如果你喜欢我的解决方案,你可以投票

In Spark, refrain from loops on rows.在 Spark 中,不要在行上循环。 Spark does not work like that, you need to think of the whole column, not about row-by-row scenario. Spark 不是那样工作的,您需要考虑整个列,而不是逐行场景。

You need to join both tables and select users, but they need to be without duplicates (distinct)您需要加入两个表并选择用户,但他们需要没有重复(不同)

from pyspark.sql import functions as F
df_distinct_matches = (
    recent_df
    .join(recommendation_df, F.col('new_party_id') == F.col('party_id'))
    .select('party_id').distinct()
)
hit = df_distinct_matches.count()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PySpark:将DataFrame列的值与另一个DataFrame列匹配 - PySpark: match the values of a DataFrame column against another DataFrame column 将 DataFrame 列值与另一个 DataFrame 列匹配并计数命中 - Match DataFrame column value against another DataFrame column and count hits 如果行匹配,PySpark设置列值等于另一个数据框值 - PySpark Set Column value equal to another dataframe value if rows match Pyspark DataFrame 列基于另一个 DataFrame 值 - Pyspark DataFrame column based on another DataFrame value 当一行中某一列的值与另一行另一列中的值匹配时,如何匹配pyspark数据框中的两行? - How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row? 如何根据另一个 dataframe 的匹配为 dataframe 的新列添加值? - how to add value to a new column to a dataframe based on the match of another dataframe? 如何查看 3 个不同的列以将一个公共数字与另一个 dataframe 的一列匹配以合并数据(如果没有匹配追加)? - How can I look through 3 diferent columns to match a common number with one column of another dataframe to merge in the data (and if no match append)? 如何用另一个值替换 Pyspark Dataframe 列中的特定值? - How to replace a particular value in a Pyspark Dataframe column with another value? 将pyspark数据帧与另一个数据帧进行比较 - Compare a pyspark dataframe to another dataframe 循环每一列并匹配该值,然后创建另一个 dataframe - Loop each column and match the value then create another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM