简体   繁体   English

根据列中特定值的计数条件过滤掉火花 dataframe 的行 [pyspark 中的 spark.sql 语法]

[英]filter out rows of a spark dataframe based on a count condition of specific value in a column [spark.sql syntax in pyspark]

I have the following spark dataframe:我有以下火花 dataframe:

datalake_spark_dataframe_downsampled = pd.DataFrame( 
                           {'id' : ['001', '001', '001', '001', '001', '002', '002', '002'],
                            'OuterSensorConnected':[0, 0, 0, 1, 0, 0, 0, 1], 
                            'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826, 31.784826],
                            'EnergyConsumption': [70, 70, 70, 70, 70, 70, 70, 70],
                            'DaysDeploymentDate': [10, 20, 21, 31, 41, 11, 19, 57],
                            'label': [0, 0, 1, 1, 1, 0, 0, 1]}
                           )
datalake_spark_dataframe_downsampled = spark.createDataFrame(datalake_spark_dataframe_downsampled )

# printSchema of the datalake_spark_dataframe_downsampled (spark df):

"root
 |-- IMEI: string (nullable = true)
 |-- OuterSensorConnected: integer (nullable = false)
 |-- OuterHumidity: float (nullable = true)
 |-- EnergyConsumption: float (nullable = true)
 |-- DaysDeploymentDate: integer (nullable = true)
 |-- label: integer (nullable = false)"

As you can see for the first id '001' I have 5 rows and for the second id '002' I have 3 rows.如您所见,第一个 id '001'我有 5 行,第二个 id '002'我有 3 行。 What I want is to filter out the rows connected to the ids that their positive label ('1') is less than 2 in total.我想要的是过滤掉与它们的正 label ('1') 总共小于 2 的 id 相关的行。 So, since for the first id '001' the number of positive labels is equal to 3 (three rows with positive label 1 in total) and for the second id '002' only 1 row with positive label, I want all the rows related to the id '002' to be filtered out.所以,因为对于第一个 id '001' ,正标签的数量等于 3(三行,正 label 总共为 1),而对于第二个 id '002',只有 1 行,正 label,我想要所有相关的行到要过滤掉的 id '002' So my final df would look like:所以我的最终 df 看起来像:

datalake_spark_dataframe_downsampled_filtered = pd.DataFrame( 
                           {'id' : ['001', '001', '001', '001', '001'],
                            'OuterSensorConnected':[0, 0, 0, 1], 
                            'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826],
                            'EnergyConsumption': [70, 70, 70, 70, 70],
                            'DaysDeploymentDate': [10, 20, 21, 31, 41],
                            'label': [0, 0, 1, 1, 1]}
                           )
datalake_spark_dataframe_downsampled_filtered = spark.createDataFrame(datalake_spark_dataframe_downsampled_filtered)

How is this achievable with a spark.sql() query?, like这如何通过 spark.sql() 查询实现?,比如

datalake_spark_dataframe_downsampled_filtered.createOrReplaceTempView("df_filtered")

spark_dataset_filtered=spark.sql("""SELECT *, count(label) as counted_label FROM df_filtered GROUP BY id HAVING counted_label >=2""") #how to only count the positive values here?

How about using a window:使用 window 怎么样:

datalake_spark_dataframe_downsampled.createOrReplaceTempView("df_filtered")

spark.sql("""select * from (select *, sum(label) over (partition by id) as Sum_l
                      from df_filtered) where Sum_l >= 2""").drop("Sum_l").show()

+---+--------------------+-------------+-----------------+------------------+-----+
| id|OuterSensorConnected|OuterHumidity|EnergyConsumption|DaysDeploymentDate|label|
+---+--------------------+-------------+-----------------+------------------+-----+
|001|                   0|    31.784826|               70|                10|    0|
|001|                   0|    32.784826|               70|                20|    0|
|001|                   0|    33.784826|               70|                21|    1|
|001|                   1|    43.784826|               70|                31|    1|
|001|                   0|    23.784826|               70|                41|    1|
+---+--------------------+-------------+-----------------+------------------+-----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark use dataframe API to run sql scripts using spark.sql and add another column as output - Spark use dataframe API to run sql scripts using spark.sql and add another column as output 根据pyspark中的条件在spark中合并两行 - Combine two rows in spark based on a condition in pyspark pyspark 中 spark.sql() 和 cursor.execute 的区别? - Difference between spark.sql() and cursor.execute in pyspark? PySpark 中的 spark.sql 语句中的字符串格式是如何工作的? - How does string formatting work in a spark.sql statement in PySpark? 如何在pyspark中将变量传递给spark.sql查询? - How to pass variables to spark.sql query in pyspark? 如何根据条件更新 spark dataframe 中的行 - How to update rows in spark dataframe based on condition 如何根据使用 Pyspark 的条件从另一个表更新 Spark DataFrame 表的列值 - How to update Spark DataFrame Column Values of a table from another table based on a condition using Pyspark 如果列在另一个 Spark Dataframe 中,Pyspark 创建新列 - Pyspark create new column based if a column isin another Spark Dataframe 是否可以使用pyspark过滤Spark DataFrames以返回列值在列表中的所有行? - Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark? 在 PySpark 中使用通配符列名称将 Spark 数据帧列转为行 - Pivot Spark Dataframe Columns to Rows with Wildcard column Names in PySpark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM