简体   繁体   English

我如何比较 PySpark 中另一个 dataframe 的列

[英]How i can compare columns from another dataframe in PySpark

I have two dataframes:我有两个数据框:
First with AVG values:首先是 AVG 值:

+----------+-----+  
| Category | AVG |  
+----------+-----+  
| Categ    | 1.0 |  
+----------+-----+  
| Categ2   | 0.5 | 
+----------+-----+
... 

... ...
Second has the fallowing category: Category, Name, Price第二个有闲置类别:Category、Name、Price
The question is: How can I delete all those records for which the price is less than the average price from the first table??问题是:如何从第一个表中删除所有价格低于平均价格的记录?
I tried that way:我试过这样:

dataGreaterAvge = data.where(data.Price >= avgCategoryPrice.where(data.Category == avgCategoryPrice.Category).collect()[0]["avg(Price)"])

dataGreaterAvge  - First dataframe
data - Second dataframe

However, this does not work as it should, because it only takes the value of the first element from the average values table但是,这并不能正常工作,因为它只从平均值表中获取第一个元素的值

Spark works like SQL... so... Spark 像 SQL 一样工作......所以......

First you need to join the dataframes.首先,您需要加入数据框。

a = df1.alias('a')
b = df2.alias('b')
df_joined = a.join(b, a.Category == b.Category)

then you will be able to filter properly那么您将能够正确过滤

from pyspark.sql import functions as f

df_joined.select(col('a.category'),col('a.AVG'))\
         .where(col('a.AVG') > f.avg(col('b.avg')).groupBy(col('a.AVG'))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何比较 pandas dataframe 的列? - How can I compare columns of a pandas dataframe? 如何比较 dask dataframe 的列? - How can I compare columns of a dask dataframe? 如何将pyspark数据帧列中的值与pyspark中的另一个数据帧进行比较 - How to compare values in a pyspark dataframe column with another dataframe in pyspark 需要创建一个 Dataframe,其中的列是通过循环另一个 Dataframe 列的值来创建的。 如何在 PySpark 中执行此操作? - Need to create a Dataframe where the columns are created by looping through the values of another Dataframe columns. How can I do this in PySpark? 如何将 select 列的行值与另一个 dataframe 中的相同列进行比较? - How can I compare the row values of select columns with the same columns in another dataframe? 将pyspark数据帧与另一个数据帧进行比较 - Compare a pyspark dataframe to another dataframe 如何获取包含字符串列表的 dataframe 并从 Pyspark 中的这些列表中创建另一个 dataframe? - How can I take a dataframe containing lists of strings and create another dataframe from these lists in Pyspark? pySpark DataFrame:如何并行比较两个 dataframe 的列? - pySpark DataFrame: how to parallelize compare the columns of two dataframe? 如何根据另一个数据框中的列填充? - How can I fillna based on the columns from another dataframe? 如何验证 pyspark 中 Dataframe 的架构(列的编号和名称)? - How can I verify the schema (number and name of the columns) of a Dataframe in pyspark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM