PANDAS 中类似 SQL 的窗口函数：Python Pandas Dataframe 中的行编号

Question

I come from a sql background and I use the following data processing step frequently:我来自 sql 背景，我经常使用以下数据处理步骤：

Partition the table of data by one or more fields按一个或多个字段对数据表进行分区
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending对于每个分区，向其每一行添加一个行号，该行号按一个或多个其他字段对行进行排名，其中分析师指定升序或降序

EX:前任：

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
           'data1' : [1,2,2,3,3],
           'data2' : [1,10,2,3,30]})
df
     data1        data2     key1    
0    1            1         a           
1    2            10        a        
2    2            2         a       
3    3            3         b       
4    3            30        a

I'm looking for how to do the PANDAS equivalent to this sql window function:我正在寻找如何执行与此 sql 窗口函数等效的 PANDAS：

RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)


    data1        data2     key1    RN
0    1            1         a       1    
1    2            10        a       2 
2    2            2         a       3
3    3            3         b       1
4    3            30        a       4

I've tried the following which I've gotten to work where there are no 'partitions':我已经尝试了以下我已经开始工作的地方没有“分区”：

def row_number(frame,orderby_columns, orderby_direction,name):
    frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
    frame[name] = list(xrange(len(frame.index)))

I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:我试图扩展这个想法来处理分区（pandas 中的组），但以下方法不起作用：

df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()

def nf(x):
    x['rn'] = list(xrange(len(x.index)))

df1['rn1'] = df1.groupby('key1').apply(nf)

But I just got a lot of NaNs when I do this.但是当我这样做时，我得到了很多 NaN。

Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?理想情况下，会有一种简洁的方法来复制 sql 的窗口函数功能（我已经找到了基于窗口的聚合......这是熊猫中的一个单行）......有人可以与我分享最惯用的方法吗？在 PANDAS 中为这样的行编号？

Answer 1

you can also use sort_values() , groupby() and finally cumcount() + 1 :您还可以使用sort_values() 、 groupby()和最后cumcount() + 1 ：

df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
             .groupby(['key1']) \
             .cumcount() + 1
print(df)

yields:产量：

   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

PS tested with pandas 0.18 PS用熊猫0.18测试

Answer 2

You can do this by using groupby twice along with the rank method:您可以通过两次使用groupby和rank方法来做到这一点：

In [11]: g = df.groupby('key1')

Use the min method argument to give values which share the same data1 the same RN:使用 min 方法参数为共享相同数据的值提供相同的 RN：

In [12]: g['data1'].rank(method='min')
Out[12]:
0    1
1    2
2    2
3    1
4    4
dtype: float64

In [13]: df['RN'] = g['data1'].rank(method='min')

And then groupby these results and add the rank with respect to data2:然后将这些结果分组并添加相对于 data2 的排名：

In [14]: g1 = df.groupby(['key1', 'RN'])

In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0    0
1    0
2    1
3    0
4    0
dtype: float64

In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1

In [17]: df
Out[17]:
   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

It feels like there ought to be a native way to do this (there may well be!...).感觉应该有一种本地方式来做到这一点（很可能有！...）。

Answer 3

You can use transform and Rank together Here is an example你可以一起使用transform和Rank这是一个例子

df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
           'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df

Have a look at Pandas Rank method for more information查看 Pandas Rank 方法以获取更多信息

Answer 4

Use groupby.rank function.使用 groupby.rank 函数。 Here the working example.这是工作示例。

df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df

C1 C2
a  1
a  2
a  3
b  4
b  5

df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df

C1 C2 RANK
a  1  1
a  2  2
a  3  3
b  4  1
b  5  2

Answer 5

pandas.lib.fast_zip() can create a tuple array from a list of array. pandas.lib.fast_zip()可以从数组列表中创建一个元组数组。 You can use this function to create a tuple series, and then rank it:您可以使用此函数创建一个元组系列，然后对其进行排名：

values = {'key1' : ['a','a','a','b','a','b'],
          'data1' : [1,2,2,3,3,3],
          'data2' : [1,10,2,3,30,20]}

df = pd.DataFrame(values, index=list("abcdef"))

def rank_multi_columns(df, cols, **kw):
    data = []
    for col in cols:
        if col.startswith("-"):
            flag = -1
            col = col[1:]
        else:
            flag = 1
        data.append(flag*df[col])
    values = pd.lib.fast_zip(data)
    s = pd.Series(values, index=df.index)
    return s.rank(**kw)

rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))

print rank

the result:结果：

a    1
b    2
c    3
d    2
e    4
f    1
dtype: float64

PANDAS 中类似 SQL 的窗口函数：Python Pandas Dataframe 中的行编号

问题描述

5 个解决方案

解决方案1
74 2016-04-18 21:18:39

解决方案2
20 已采纳 2013-07-21 21:24:07

解决方案3
16 2018-01-26 02:10:33

解决方案4
16 2019-09-04 12:16:20

解决方案5
0 2013-07-22 03:14:01

PANDAS 中类似 SQL 的窗口函数：Python Pandas Dataframe 中的行编号

问题描述

5 个解决方案

解决方案1 74 2016-04-18 21:18:39

解决方案2 20 已采纳 2013-07-21 21:24:07

解决方案3 16 2018-01-26 02:10:33

解决方案4 16 2019-09-04 12:16:20

解决方案5 0 2013-07-22 03:14:01

解决方案1
74 2016-04-18 21:18:39

解决方案2
20 已采纳 2013-07-21 21:24:07

解决方案3
16 2018-01-26 02:10:33

解决方案4
16 2019-09-04 12:16:20

解决方案5
0 2013-07-22 03:14:01