使用 PySpark 加入两个数据帧。我在单独的 DF 中有一个 unique_id 和一个 non_unique_id 列。如何通过 unique_id 过滤非唯一列？

Question

There are two dataframes.有两个数据框。 The first one is named products_purchased , the second one is named products_suggested .第一个名为products_purchased ，第二个名为products_suggested 。

In the products_purchased alias(pp) table there is a customer_id (unique) column, an item_id column, and a purchased column (with the value of 1).在products_purchased alias(pp) 表中，有一个customer_id （唯一）列、一个item_id列和一个purchased列（值为 1）。

The products_suggested alias (ps) table has a customer_id (non-unique) and an item_id column. products_suggested别名 (ps) 表有一个customer_id （非唯一）和一个item_id列。 This table has more customer_id's than the product_product purchased table, as not all customers who are suggested items, purchase them.该表的 customer_id 多于 product_product purchased 表，因为并非所有推荐商品的客户都会购买。

I would like to join the two tables, retaining the purchased column for the places where ps.customer_id (non-unique) and an ps.item_id match pp.customer_id (unique) column, an pp.item_id .我想加入这两个表，为ps.customer_id （非唯一）和ps.item_id匹配pp.customer_id （唯一）列，一个pp.item_id的地方保留purchased的列。 I would also like to keep any records where pp.customer_id (unique) match ps.customer_id (non-unique).我还想保留pp.customer_id （唯一）匹配ps.customer_id （非唯一）的任何记录。

The idea is to have a table where only the records relate to the customers who went on to purchase an item.这个想法是有一个表，其中只有与继续购买商品的客户相关的记录。 That item would be labeled with a 1 in the purchased column, their other suggested items would be labeled 0.该项目将在已购买列中标记为 1，他们建议的其他项目将标记为 0。

Product Suggested Table

 customer_id|item_id       |   
+----------+---------------+
|     16413|          51654|   
|     16413|          75950|
|     16413|        1366117|
|     78450|          56107|               
|     94038|          72358|               
|     94038|        1451889| 
|    113067|          75077|       
|     89578|          53279|

Product Purchased Table

 customer_id|item_id       |purchased 
+-----------+--------------+---------+
|      16413|         75950|        1|
|      78450|         56107|        1|
|      94038|         72358|        1|

Final Table

 customer_id|item_id       |purchased 
+-----------+--------------+---------+
|      16413|         51654|        0|
|      16413|         75950|        1|
|      16413|       1366117|        0|
|      78450|         56107|        1|       
|      94038|         72358|        1|     
|      94038|       1451889|        0|

I tried a left join on customer_id and item_suggested.我尝试了对 customer_id 和 item_suggested 的左连接。 I got what I expected, a table with all the suggested items regardless of their customer purchased, then the purchased status attached:我得到了我所期望的，一张包含所有建议项目的表格，无论他们的客户购买了多少，然后附上了购买状态：

final = products_suggested.join(
  products_purchased,on =["customer_id",'item_id'], how= 'left')

Final Table

 customer_id|item_id       |purchased 
+-----------+--------------+---------+
|      16413|         51654|        0|
|      16413|         75950|        1|
|      16413|       1366117|        0|
|      78450|         56107|        1|       
|      94038|         72358|        1|     
|      94038|       1451889|        0|
|    113067|          75077|        0|
|     89578|          53279|        0|

I tried an inner join as well on just the customer_id.我也在 customer_id 上尝试了内部连接。 That made it so all my purchased columns were 1. I'm guessing thats because anywhere a customer_id matched the purchased version, it just placed the 1.这使得我购买的所有列都是 1。我猜那是因为 customer_id 与购买的版本匹配的任何地方，它只是放置了 1。

I also tried filtering after the left join .where(pp['customer_id']==ps['customer_id]) , but that didn't seem to work either.我还尝试在 left join .where(pp['customer_id']==ps['customer_id])之后进行过滤，但这似乎也不起作用。

Answer 1

I created another dataframe that only had the customer_id from the product purchased.我创建了另一个 dataframe，它只有购买产品的 customer_id。 Then I joined the merged left join table I tried above, with an inner join.然后我加入了我在上面尝试过的合并左连接表，并带有一个内部连接。 This filtered out the remaining non purchasing customers.这过滤掉了剩余的非购买客户。

purchase_only_customers = left_join_table.join(purchase_table, on =["customer_id"], how= 'inner')

使用 PySpark 加入两个数据帧。我在单独的 DF 中有一个 unique_id 和一个 non_unique_id 列。如何通过 unique_id 过滤非唯一列？

问题描述

1 个解决方案

解决方案1
0 2022-12-14 17:19:30

使用 PySpark 加入两个数据帧。 我在单独的 DF 中有一个 unique_id 和一个 non_unique_id 列。 如何通过 unique_id 过滤非唯一列？

问题描述

1 个解决方案

解决方案1 0 2022-12-14 17:19:30

使用 PySpark 加入两个数据帧。我在单独的 DF 中有一个 unique_id 和一个 non_unique_id 列。如何通过 unique_id 过滤非唯一列？

解决方案1
0 2022-12-14 17:19:30