[英]pyspark dataframe to get top 5 rows using sql or pandas dataframe
我試圖根據 rate_increase 獲得每個地區的前 5 個項目。 我正在嘗試使用 spark.sql 如下:
輸入:
district item rate_increase(%)
Arba coil 500
Arba pen -85
Arba hat 50
Cebu oil -40
Cebu pen 1100
Top5item = spark.sql('select district, item , rate_increase, ROW_NUMBER() OVER (PARTITION BY district ORDER BY rate_increase DESC) AS RowNum from rateTable where rate_increase > 0')
這有效。 如何在同一個語句中過濾前 5 個產品。 我嘗試如下,是通過 spar.sql 做更好的方法嗎?
Top5item = spark.sql('select district, item from (select NCSA, Product, growthRate, ROW_NUMBER() OVER (PARTITION BY NCSA ORDER BY growthRate DESC) AS RowNum from rateTable where rate_increase > 0) where RowNum <= 5 order by NCSA')
輸出:
district item rate_increase(%)
Arba coil 500
Arba hat 50
Cebu pen 1100
謝謝。
Lilly,您可以使用 Pandas 從 csv 讀取數據或創建如下所示的 Pandas 數據框,然后將其轉換為 Spark 數據框
import pandas as pd
data_1 = {
'district': ["Arba", "Arba", "Arba","Cebu", "Cebu"],
'item': ['coil', 'pen', 'hat','oil','pen'],
'rate_increase(%)': [500,-85,50,-40,1100]}
pandas_df = pd.DataFrame(data_1)
ddf_1 = spark.createDataFrame(pandas_df)
ddf_1.createOrReplaceTempView("ddf_1")
output = spark.sql("""
select district, item , `rate_increase(%)` from (
select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)` from ddf_1 where `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum
""")
output.show()
+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
| Arba|coil| 500|
| Arba| hat| 50|
| Cebu| pen| 1100|
+--------+----+----------------+
請記住查詢的執行順序:
從/加入 -> 哪里 -> 分組依據 -> 擁有 -> 選擇
where 子句where RowNum <= 5
不起作用,因為它不知道什么是RowNum
。
嘗試使用子查詢塊:
spark.sql("""
select district, item , `rate_increase(%)` from (
select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)` from ddf_1 where `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum
""").show()
輸出:
+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
| Arba|coil| 500|
| Arba| hat| 50|
| Cebu| pen| 1100|
+--------+----+----------------+
我嘗試使用熊貓作為一個簡單的解決方案。
Top5item = df.sort_values('rate_increase(%)', ascending = True).groupby(['district']).head(5)
按地區分組后的升序( rate_increase(%)
)仍然不起作用。 謝謝
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.