简体   繁体   English

使用 PySpark 加入两个数据帧。 我在单独的 DF 中有一个 unique_id 和一个 non_unique_id 列。 如何通过 unique_id 过滤非唯一列?

[英]Using PySpark join on two dataframes. I have one unique_id and one non_unique_id column in separate DF. How to filter non-unique column by unique_id?

There are two dataframes.有两个数据框。 The first one is named products_purchased , the second one is named products_suggested .第一个名为products_purchased ,第二个名为products_suggested

In the products_purchased alias(pp) table there is a customer_id (unique) column, an item_id column, and a purchased column (with the value of 1).products_purchased alias(pp) 表中,有一个customer_id (唯一)列、一个item_id列和一个purchased列(值为 1)。

The products_suggested alias (ps) table has a customer_id (non-unique) and an item_id column. products_suggested别名 (ps) 表有一个customer_id (非唯一)和一个item_id列。 This table has more customer_id's than the product_product purchased table, as not all customers who are suggested items, purchase them.该表的 customer_id 多于 product_product purchased 表,因为并非所有推荐商品的客户都会购买。

I would like to join the two tables, retaining the purchased column for the places where ps.customer_id (non-unique) and an ps.item_id match pp.customer_id (unique) column, an pp.item_id .我想加入这两个表,为ps.customer_id (非唯一)和ps.item_id匹配pp.customer_id (唯一)列,一个pp.item_id的地方保留purchased的列。 I would also like to keep any records where pp.customer_id (unique) match ps.customer_id (non-unique).我还想保留pp.customer_id (唯一)匹配ps.customer_id (非唯一)的任何记录。

The idea is to have a table where only the records relate to the customers who went on to purchase an item.这个想法是有一个表,其中只有与继续购买商品的客户相关的记录。 That item would be labeled with a 1 in the purchased column, their other suggested items would be labeled 0.该项目将在已购买列中标记为 1,他们建议的其他项目将标记为 0。

Product Suggested Table

 customer_id|item_id       |   
+----------+---------------+
|     16413|          51654|   
|     16413|          75950|
|     16413|        1366117|
|     78450|          56107|               
|     94038|          72358|               
|     94038|        1451889| 
|    113067|          75077|       
|     89578|          53279|              
Product Purchased Table

 customer_id|item_id       |purchased 
+-----------+--------------+---------+
|      16413|         75950|        1|
|      78450|         56107|        1|
|      94038|         72358|        1|
Final Table

 customer_id|item_id       |purchased 
+-----------+--------------+---------+
|      16413|         51654|        0|
|      16413|         75950|        1|
|      16413|       1366117|        0|
|      78450|         56107|        1|       
|      94038|         72358|        1|     
|      94038|       1451889|        0|

I tried a left join on customer_id and item_suggested.我尝试了对 customer_id 和 item_suggested 的左连接。 I got what I expected, a table with all the suggested items regardless of their customer purchased, then the purchased status attached:我得到了我所期望的,一张包含所有建议项目的表格,无论他们的客户购买了多少,然后附上了购买状态:

final = products_suggested.join(
  products_purchased,on =["customer_id",'item_id'], how= 'left')
Final Table

 customer_id|item_id       |purchased 
+-----------+--------------+---------+
|      16413|         51654|        0|
|      16413|         75950|        1|
|      16413|       1366117|        0|
|      78450|         56107|        1|       
|      94038|         72358|        1|     
|      94038|       1451889|        0|
|    113067|          75077|        0|
|     89578|          53279|        0|

I tried an inner join as well on just the customer_id.我也在 customer_id 上尝试了内部连接。 That made it so all my purchased columns were 1. I'm guessing thats because anywhere a customer_id matched the purchased version, it just placed the 1.这使得我购买的所有列都是 1。我猜那是因为 customer_id 与购买的版本匹配的任何地方,它只是放置了 1。

I also tried filtering after the left join .where(pp['customer_id']==ps['customer_id]) , but that didn't seem to work either.我还尝试在 left join .where(pp['customer_id']==ps['customer_id])之后进行过滤,但这似乎也不起作用。

I created another dataframe that only had the customer_id from the product purchased.我创建了另一个 dataframe,它只有购买产品的 customer_id。 Then I joined the merged left join table I tried above, with an inner join.然后我加入了我在上面尝试过的合并左连接表,并带有一个内部连接。 This filtered out the remaining non purchasing customers.这过滤掉了剩余的非购买客户。

purchase_only_customers = left_join_table.join(purchase_table, on =["customer_id"], how= 'inner')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM