pyspark 數據框使用 sql 或 pandas 數據框獲取前 5 行

Question

我試圖根據 rate_increase 獲得每個地區的前 5 個項目。 我正在嘗試使用 spark.sql 如下：

輸入：

   district   item   rate_increase(%)
     Arba     coil    500
     Arba     pen    -85
     Arba     hat     50
     Cebu     oil    -40
     Cebu     pen     1100

Top5item = spark.sql('select district, item , rate_increase, ROW_NUMBER() OVER (PARTITION BY district ORDER BY rate_increase DESC) AS RowNum from rateTable where rate_increase > 0')

這有效。 如何在同一個語句中過濾前 5 個產品。 我嘗試如下，是通過 spar.sql 做更好的方法嗎？

Top5item = spark.sql('select district, item from (select NCSA, Product, growthRate, ROW_NUMBER() OVER (PARTITION BY NCSA ORDER BY growthRate DESC) AS RowNum from rateTable where rate_increase > 0) where RowNum <= 5 order by NCSA')

輸出：

   district   item   rate_increase(%)
     Arba     coil    500
     Arba     hat     50
     Cebu     pen     1100

謝謝。

Answer 1

Lilly，您可以使用 Pandas 從 csv 讀取數據或創建如下所示的 Pandas 數據框，然后將其轉換為 Spark 數據框

import pandas as pd

data_1 = { 
    'district': ["Arba", "Arba", "Arba","Cebu", "Cebu"],
    'item': ['coil', 'pen', 'hat','oil','pen'],
    'rate_increase(%)': [500,-85,50,-40,1100]}
    pandas_df = pd.DataFrame(data_1)
ddf_1 = spark.createDataFrame(pandas_df)
ddf_1.createOrReplaceTempView("ddf_1")

output = spark.sql("""

select district, item , `rate_increase(%)` from (
  select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)`  from ddf_1  where  `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum

""")

output.show()

+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
|    Arba|coil|             500|
|    Arba| hat|              50|
|    Cebu| pen|            1100|
+--------+----+----------------+

Answer 2

請記住查詢的執行順序：

從/加入 -> 哪里 -> 分組依據 -> 擁有 -> 選擇

where 子句where RowNum <= 5不起作用，因為它不知道什么是RowNum 。

嘗試使用子查詢塊：

spark.sql("""

select district, item , `rate_increase(%)` from (
  select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)`  from ddf_1  where  `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum

""").show()

輸出：

+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
|    Arba|coil|             500|
|    Arba| hat|              50|
|    Cebu| pen|            1100|
+--------+----+----------------+

Answer 3

我嘗試使用熊貓作為一個簡單的解決方案。

Top5item = df.sort_values('rate_increase(%)', ascending = True).groupby(['district']).head(5)

按地區分組后的升序（ rate_increase(%) ）仍然不起作用。 謝謝

pyspark 數據框使用 sql 或 pandas 數據框獲取前 5 行

問題描述

3 個解決方案

解決方案1
1 2020-02-06 17:15:10

解決方案2
0 2020-02-05 11:00:16

解決方案3
0 已采納 2020-02-05 13:36:02

pyspark 數據框使用 sql 或 pandas 數據框獲取前 5 行

問題描述

3 個解決方案

解決方案1 1 2020-02-06 17:15:10

解決方案2 0 2020-02-05 11:00:16

解決方案3 0 已采納 2020-02-05 13:36:02

解決方案1
1 2020-02-06 17:15:10

解決方案2
0 2020-02-05 11:00:16

解決方案3
0 已采納 2020-02-05 13:36:02