[英]SQL to pandas: DENSE_RANK() OVER (PARTITION BY )
I am trying to translate the following piece of SQL code to a pandas equivalent我正在尝试将以下 SQL 代码转换为 pandas 等效代码
SELECT
t.company,
t.topic,
t.statement
FROM
(
SELECT
e.company,
e.topic,
e.probability,
e.distance,
LOWER(e.statement) AS statement,
dense_rank() OVER (PARTITION BY e.company,e.topic ORDER BY e.distance DESC) as rank
FROM
esg.group_dist e
) t
WHERE
t.rank = 1
AND t.topic IN ('green energy')
ORDER BY
company,
topic,
rank
I got as far as我做到了
esg_group_dist['rank'] = esg_group_dist[['company', 'topic', 'probability', 'distance', 'sentence']] \
.sort_values(by=['distance']) \
.groupby(['company', 'topic']) \
I found the following SO thread that should contain a solution but I can't manage to successfully implement it for my usecase我发现以下 SO 线程应该包含一个解决方案,但我无法成功地为我的用例实现它
Thanks!谢谢!
There is groupby.rank
:有
groupby.rank
:
esg_group_dist['rank'] = (esg_group_dist.groupby(['company', 'topic'])
['disance'].rank(method='dense', ascending=False)
)
However, looking at your entire query, it looks like you're trying to extract info where distance
is maximum但是,查看您的整个查询,您似乎正在尝试提取
distance
最大的信息minimumwithin each group.最低限度
每个组内。 You can do so faster with
你可以更快地做到这一点
(esg_group_dist[['company', 'topic', 'probability', 'distance', 'sentence']]
.sort_values('distance') # sort values
.drop_duplicates(['company','topic'], keep='last') # keep the first rows
.query('topic=="green energy"') # filter topic
)
Note : to find minimum rows, remove ascending=False
and keep='last'
.注意:要查找最小行,请删除
ascending=False
和keep='last'
。 Also there is groupby().idxmin/idxmax()
option`.还有
groupby().idxmin/idxmax()
选项`。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.