[英]Sort by column and append counter using pandas
I have a large data frame (>1m rows, 10+ cols) that I need to do the following to:我有一个大数据框(>1m 行,10+ 列),我需要执行以下操作:
A
& B
in example)按两列分组(例如A
& B
)C
in example)按另一列在分组内排序(例如C
)C
( E
in example) Append 另一个列的增量计数器,按C
的排序值的顺序递增(例如E
)D
in example)保留其他列未编辑(例如D
)I have got the following code working, which gives the correct results.我有以下代码工作,它给出了正确的结果。 However, it is very slow.但是,它非常缓慢。 Can anyone suggest some pandas magic to get improved performance?谁能建议一些 pandas 魔术来提高性能?
import pandas as pd
import numpy as np
np.random.seed(0)
A = list()
B = list()
C = list()
D = list()
E = list()
np_alphabet = np.array(list('ABCEEFGHIJKLMNOPQRSTUVWXYZ'), dtype="|S1")
np_codes = np.random.choice(np_alphabet, [5, 10])
for a in np_codes:
for b in range(2):
for i in range(5):
A.append(''.join(a))
B.append('{}_{}'.format(b, A[-1]))
C.append(np.random.rand())
D.append(i)
E.append(B[-1])
df = pd.DataFrame({
'A': A,
'B': B,
'C': C,
'D': D,
'E': E
})
df.set_index(['A', 'B'], drop=False, inplace=True)
df.sort_index(inplace=True)
print(df)
grouped_sizes = df.groupby(level=[0, 1]).size()
num_indices = grouped_sizes.shape[0]
print_num = max(1, num_indices // 20)
for idx in grouped_sizes.index:
if grouped_sizes[idx] > 1:
tmp_df = df.loc[idx].sort_values('C', inplace=False)
tmp_df['E'] = map(lambda x: '{}_{}'.format(*x), zip(range(1, tmp_df.size + 1), tmp_df['E']))
df.loc[idx] = tmp_df
else:
df.loc[idx, 'E'] = '1_{}'.format(df.loc[idx, 'E'])
print(df)
which gives the following output这给出了以下 output
# before
A B C D E
A B
AXEJWVIZMZ 0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.954914 0 0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.758615 1 0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.952573 2 0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.903142 3 0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.154262 4 0_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.560586 0 1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.528869 1 1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.115331 2 1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.380718 3 1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.147092 4 1_AXEJWVIZMZ
BVBSPSACVA 0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.824997 0 0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.264456 1 0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.282663 2 0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.678287 3 0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.409996 4 0_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.149984 0 1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.711210 1 1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.840399 2 1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.804939 3 1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.290150 4 1_BVBSPSACVA
FHNMIRQRSP 0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.119058 0 0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.021955 1 0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.299527 2 0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.449371 3 0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.179845 4 0_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.075765 0 1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.413373 1 1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.835250 2 1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.371984 3 1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.265494 4 1_FHNMIRQRSP
TJSECSLWFT 0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.804553 0 0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.376646 1 0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.904908 2 0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.274501 3 0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.820866 4 0_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.886687 0 1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.198887 1 1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.857795 2 1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.326926 3 1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.116743 4 1_TJSECSLWFT
WXEKPQSLQK 0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.249891 0 0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.945414 1 0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.235062 2 0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.082703 3 0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.894169 4 0_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.595575 0 1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.769144 1 1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.917691 2 1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.567448 3 1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.832299 4 1_WXEKPQSLQK
and和
# after
A B C D E
A B
AXEJWVIZMZ 0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.154262 4 1_0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.758615 1 2_0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.903142 3 3_0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.952573 2 4_0_AXEJWVIZMZ
0_AXEJWVIZMZ AXEJWVIZMZ 0_AXEJWVIZMZ 0.954914 0 5_0_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.115331 2 1_1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.147092 4 2_1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.380718 3 3_1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.528869 1 4_1_AXEJWVIZMZ
1_AXEJWVIZMZ AXEJWVIZMZ 1_AXEJWVIZMZ 0.560586 0 5_1_AXEJWVIZMZ
BVBSPSACVA 0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.264456 1 1_0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.282663 2 2_0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.409996 4 3_0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.678287 3 4_0_BVBSPSACVA
0_BVBSPSACVA BVBSPSACVA 0_BVBSPSACVA 0.824997 0 5_0_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.149984 0 1_1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.290150 4 2_1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.711210 1 3_1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.804939 3 4_1_BVBSPSACVA
1_BVBSPSACVA BVBSPSACVA 1_BVBSPSACVA 0.840399 2 5_1_BVBSPSACVA
FHNMIRQRSP 0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.021955 1 1_0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.119058 0 2_0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.179845 4 3_0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.299527 2 4_0_FHNMIRQRSP
0_FHNMIRQRSP FHNMIRQRSP 0_FHNMIRQRSP 0.449371 3 5_0_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.075765 0 1_1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.265494 4 2_1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.371984 3 3_1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.413373 1 4_1_FHNMIRQRSP
1_FHNMIRQRSP FHNMIRQRSP 1_FHNMIRQRSP 0.835250 2 5_1_FHNMIRQRSP
TJSECSLWFT 0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.274501 3 1_0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.376646 1 2_0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.804553 0 3_0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.820866 4 4_0_TJSECSLWFT
0_TJSECSLWFT TJSECSLWFT 0_TJSECSLWFT 0.904908 2 5_0_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.116743 4 1_1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.198887 1 2_1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.326926 3 3_1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.857795 2 4_1_TJSECSLWFT
1_TJSECSLWFT TJSECSLWFT 1_TJSECSLWFT 0.886687 0 5_1_TJSECSLWFT
WXEKPQSLQK 0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.082703 3 1_0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.235062 2 2_0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.249891 0 3_0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.894169 4 4_0_WXEKPQSLQK
0_WXEKPQSLQK WXEKPQSLQK 0_WXEKPQSLQK 0.945414 1 5_0_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.567448 3 1_1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.595575 0 2_1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.769144 1 3_1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.832299 4 4_1_WXEKPQSLQK
1_WXEKPQSLQK WXEKPQSLQK 1_WXEKPQSLQK 0.917691 2 5_1_WXEKPQSLQK
from __future__ import print_function, division
from timeit import Timer
import pandas as pd
import numpy as np
def create_df():
np.random.seed(0)
A = list()
B = list()
C = list()
D = list()
E = list()
np_alphabet = np.array(list('ABCEEFGHIJKLMNOPQRSTUVWXYZ'), dtype="|S1")
np_codes = np.random.choice(np_alphabet, [100, 10])
for a in np_codes:
for b in range(2):
for i in range(5):
A.append(''.join(a))
B.append('{}_{}'.format(b, A[-1]))
C.append(np.random.rand())
D.append(i)
E.append(B[-1])
df = pd.DataFrame({
'A': A,
'B': B,
'C': C,
'D': D,
'E': E
})
return df.copy()
def method1(df):
df.set_index(['A', 'B'], drop=False, inplace=True)
df.sort_index(inplace=True)
grouped_sizes = df.groupby(level=[0, 1]).size()
for idx in grouped_sizes.index:
if grouped_sizes[idx] > 1:
tmp_df = df.loc[idx].sort_values('C', inplace=False)
tmp_df['E'] = map(lambda x: '{}_{}'.format(*x), zip(range(1, tmp_df.size + 1), tmp_df['E']))
df.loc[idx] = tmp_df
else:
df.loc[idx, 'E'] = '1_{}'.format(df.loc[idx, 'E'])
return df
def method1a(df):
df.set_index(['A', 'B'], drop=False, inplace=True)
df.sort_values(['A', 'B', 'C'], inplace=True)
grouped_sizes = df.groupby(level=[0, 1]).size()
for idx in grouped_sizes.index:
if grouped_sizes[idx] > 1:
df.loc[idx, 'E'] = map(lambda x: '{}_{}'.format(*x), zip(range(1, grouped_sizes[idx] + 1), df.loc[idx, 'E']))
else:
df.loc[idx, 'E'] = '1_{}'.format(df.loc[idx, 'E'])
return df
def method2(df):
df.set_index(['A', 'B'], drop=False, inplace=True)
df['F'] = 0
df.sort_values(['A', 'B', 'C'], inplace=True)
grouped_sizes = df.groupby(level=[0, 1]).size()
for idx in grouped_sizes.index:
if grouped_sizes[idx] > 1:
df.loc[idx, 'F'] = range(1, grouped_sizes[idx] + 1)
else:
df.loc[idx, 'F'] = 1
df['E'] = df['F'].map(str) + '_' + df['E']
df.drop('F', axis=1, inplace=True)
return df
def method3(df):
df.set_index(['A', 'B'], drop=False, inplace=True)
df['F'] = 0
df.sort_values(['A', 'B', 'C'], inplace=True)
grouped_sizes = df.groupby(level=[0, 1]).size()
for idx in grouped_sizes.index:
if grouped_sizes[idx] > 1:
df.loc[idx, 'F'] = map(str, range(1, grouped_sizes[idx] + 1))
else:
df.loc[idx, 'F'] = '1'
df['E'] = df['F'] + '_' + df['E']
df.drop('F', axis=1, inplace=True)
return df
def method4(df):
prefixes = df.groupby(['A', 'B']).C.apply(pd.Series.argsort).add(1).astype(str)
df['E'] = prefixes + '_' + df.E
return df
def method5(df):
df.set_index(['A', 'B'], drop=False, inplace=True)
df.sort_values(['A', 'B', 'C'], inplace=True)
df['E'] = df.groupby(level=[0, 1]).cumcount().add(1).astype(str) + '_' + df['E']
return df
def assert_success(df):
row = df[(df['A'] == 'AEVFGIURPE') & (df['B'] == '0_AEVFGIURPE') & (df['D'] == 2)].iloc[0]
if not np.allclose(row['C'], 0.381397) or row['E'] != '3_0_AEVFGIURPE':
print('A: method{}() failed: {} != 0.871083 or {} != 5_1_XOYRFZNIJU'.format(func, row['C'], row['E']))
return
row = df[(df['A'] == 'XOYRFZNIJU') & (df['B'] == '1_XOYRFZNIJU') & (df['D'] == 1)].iloc[0]
if not np.allclose(row['C'], 0.871083) or row['E'] != '5_1_XOYRFZNIJU':
print('B: method{}() failed: {} != 0.871083 or {} != 5_1_XOYRFZNIJU'.format(func, row['C'], row['E']))
return
functions = list()
functions.append('1')
functions.append('1a')
functions.append('2')
functions.append('3')
functions.append('4')
functions.append('5')
for func in functions:
print('method{}'.format(func),
Timer(setup='from __main__ import create_df, assert_success, method{} as func'.format(func),
stmt='df = create_df(); df = func(df); assert_success(df)').repeat(number=10))
Which gives the following results:这给出了以下结果:
method1 [6.581194877624512, 6.625822067260742, 6.722187042236328]
method1a [1.9003210067749023, 1.9387969970703125, 1.9142169952392578]
method2 [0.9547598361968994, 0.9532740116119385, 0.9760739803314209]
method3 [1.0121638774871826, 1.0000989437103271, 0.9709858894348145]
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
B: method4() failed: 0.871082572438 != 0.871083 or 1_1_XOYRFZNIJU != 5_1_XOYRFZNIJU
method4 [0.3202958106994629, 0.3348369598388672, 0.33800482749938965]
method5 [0.11518096923828125, 0.10490703582763672, 0.09626197814941406]
I think you need sort_values
first, then groupby
+ cumcount
for counter, then add 1
and convert to str
:我认为您首先需要sort_values
,然后是groupby
+ cumcount
计数器,然后加1
并转换为str
:
df.sort_values(['A', 'B', 'C'], inplace=True)
df['E'] = df.groupby(level=['A','B']).cumcount().add(1).astype(str) + '_' + df['E']
print(df)
A B C D E
A B
MPVAEEHJTV 0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.264556 3 1_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.414662 2 2_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.521848 1 3_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.774234 4 4_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.944669 0 5_0_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.018790 2 1_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.456150 0 2_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.568434 1 3_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.612096 4 4_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.617635 3 5_1_MPVAEEHJTV
RTTTOHABJZ 0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.096098 2 1_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.097101 0 2_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.468651 4 3_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.837945 1 4_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.976459 3 5_0_RTTTOHABJZ
1_RTTTOHABJZ RTTTOHABJZ 1_RTTTOHABJZ 0.039188 3 1_1_RTTTOHABJZ
...
...
EDIT:编辑:
It seems like bug if dont define levels
by columns only if column names are same as index names:如果只有在列名与索引名相同时才按列定义levels
,这似乎是错误:
df['E'] = df.groupby(['A','B']).cumcount().add(1).astype(str) + '_' + df['E']
FutureWarning: 'A' is both a column name and an index level. FutureWarning:“A”既是列名又是索引级别。
Defaulting to column but this will raise an ambiguity error in a future version df['E'] = df.groupby(column=['A','B']).cumcount().add(1).astype(str) + '_' + df['E']默认为 column 但这会在未来版本中引发歧义错误 df['E'] = df.groupby(column=['A','B']).cumcount().add(1).astype(str ) + '_' + df['E']FutureWarning: 'B' is both a column name and an index level. FutureWarning: 'B' 既是列名又是索引级别。 Defaulting to column but this will raise an ambiguity error in a future version df['E'] = df.groupby(column=['A','B']).cumcount().add(1).astype(str) + '_' + df['E']默认为 column 但这会在未来版本中引发歧义错误 df['E'] = df.groupby(column=['A','B']).cumcount().add(1).astype(str ) + '_' + df['E']
One possible solution is rename_axis
if necessary groupby by columns names (or rename columns which same names as index names):一种可能的解决方案是rename_axis
必要时按列名分组(或重命名与索引名称相同的列):
df.sort_values(['A', 'B', 'C'], inplace=True)
df = df.rename_axis(('A_lev','B_lev'))
df['E'] = df.groupby(['A','B']).cumcount().add(1).astype(str) + '_' + df['E']
print(df)
A B C D E
A_lev B_lev
MPVAEEHJTV 0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.264556 3 1_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.414662 2 2_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.521848 1 3_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.774234 4 4_0_MPVAEEHJTV
0_MPVAEEHJTV MPVAEEHJTV 0_MPVAEEHJTV 0.944669 0 5_0_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.018790 2 1_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.456150 0 2_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.568434 1 3_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.612096 4 4_1_MPVAEEHJTV
1_MPVAEEHJTV MPVAEEHJTV 1_MPVAEEHJTV 0.617635 3 5_1_MPVAEEHJTV
RTTTOHABJZ 0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.096098 2 1_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.097101 0 2_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.468651 4 3_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.837945 1 4_0_RTTTOHABJZ
0_RTTTOHABJZ RTTTOHABJZ 0_RTTTOHABJZ 0.976459 3 5_0_RTTTOHABJZ
1_RTTTOHABJZ RTTTOHABJZ 1_RTTTOHABJZ 0.039188 3 1_1_RTTTOHABJZ
1_RTTTOHABJZ RTTTOHABJZ 1_RTTTOHABJZ 0.282807 4 2_1_RTTTOHABJZ
1_RTTTOHABJZ RTTTOHABJZ 1_RTTTOHABJZ 0.604846 1 3_1_RTTTOHABJZ
1_RTTTOHABJZ RTTTOHABJZ 1_RTTTOHABJZ 0.739264 2 4_1_RTTTOHABJZ
1_RTTTOHABJZ RTTTOHABJZ 1_RTTTOHABJZ 0.976761 0 5_1_RTTTOHABJZ
...
...
Sort your dataframe first, then use the .groupby()
method with a .cumcount()
:首先对 dataframe 进行排序,然后将.groupby()
方法与.cumcount()
一起使用:
df.sort_values(['A','B','C'], inplace = True)
df['D'] = df.groupby(['A','B']).cumcount()
If you want to sort it the other way, supply an ascending
parameter to .sort_values()
: ascending = [True,True,False]
.如果您想以另一种方式对其进行排序,请为.sort_values()
提供ascending
参数: ascending = [True,True,False]
。 You also actually don't need to sort by A and B if you want to retain the original ordering of the data (they will still be 'ranked' correctly).如果您想保留数据的原始顺序,您实际上也不需要按 A 和 B 排序(它们仍将被正确“排序”)。
A general remark: if you find yourself using a for loop in pandas to iterate over data, chances are that you can speed up your code substantially by making use of pandas' vectorization.一般性评论:如果您发现自己在 pandas 中使用 for 循环来迭代数据,那么您很可能可以通过使用 pandas 的矢量化来显着加快代码速度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.