[英]How i can compare columns from another dataframe in PySpark
I have two dataframes:我有两个数据框:
First with AVG values:首先是 AVG 值:
+----------+-----+
| Category | AVG |
+----------+-----+
| Categ | 1.0 |
+----------+-----+
| Categ2 | 0.5 |
+----------+-----+
...
... ...
Second has the fallowing category: Category, Name, Price第二个有闲置类别:Category、Name、Price
The question is: How can I delete all those records for which the price is less than the average price from the first table??问题是:如何从第一个表中删除所有价格低于平均价格的记录?
I tried that way:我试过这样:
dataGreaterAvge = data.where(data.Price >= avgCategoryPrice.where(data.Category == avgCategoryPrice.Category).collect()[0]["avg(Price)"])
dataGreaterAvge - First dataframe
data - Second dataframe
However, this does not work as it should, because it only takes the value of the first element from the average values table但是,这并不能正常工作,因为它只从平均值表中获取第一个元素的值
Spark works like SQL... so... Spark 像 SQL 一样工作......所以......
First you need to join the dataframes.首先,您需要加入数据框。
a = df1.alias('a')
b = df2.alias('b')
df_joined = a.join(b, a.Category == b.Category)
then you will be able to filter properly那么您将能够正确过滤
from pyspark.sql import functions as f
df_joined.select(col('a.category'),col('a.AVG'))\
.where(col('a.AVG') > f.avg(col('b.avg')).groupBy(col('a.AVG'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.