我如何比较 PySpark 中另一个 dataframe 的列

Question

I have two dataframes:我有两个数据框：
First with AVG values:首先是 AVG 值：

+----------+-----+  
| Category | AVG |  
+----------+-----+  
| Categ    | 1.0 |  
+----------+-----+  
| Categ2   | 0.5 | 
+----------+-----+
...

... ...
Second has the fallowing category: Category, Name, Price第二个有闲置类别：Category、Name、Price
The question is: How can I delete all those records for which the price is less than the average price from the first table??问题是：如何从第一个表中删除所有价格低于平均价格的记录？
I tried that way:我试过这样：

dataGreaterAvge = data.where(data.Price >= avgCategoryPrice.where(data.Category == avgCategoryPrice.Category).collect()[0]["avg(Price)"])

dataGreaterAvge  - First dataframe
data - Second dataframe

However, this does not work as it should, because it only takes the value of the first element from the average values table但是，这并不能正常工作，因为它只从平均值表中获取第一个元素的值

Answer 1

Spark works like SQL... so... Spark 像 SQL 一样工作......所以......

First you need to join the dataframes.首先，您需要加入数据框。

a = df1.alias('a')
b = df2.alias('b')
df_joined = a.join(b, a.Category == b.Category)

then you will be able to filter properly那么您将能够正确过滤

from pyspark.sql import functions as f

df_joined.select(col('a.category'),col('a.AVG'))\
         .where(col('a.AVG') > f.avg(col('b.avg')).groupBy(col('a.AVG'))

我如何比较 PySpark 中另一个 dataframe 的列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-09 14:43:16

我如何比较 PySpark 中另一个 dataframe 的列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-09 14:43:16

解决方案1
1 已采纳 2020-05-09 14:43:16