I have a pandas dataframe as below
predictions.head()
Out[22]:
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U
Each row is a the prediction probability of the different classes (columns). The last column is the label (correct class).
I would like to evaluate the performances of the classifiers allowing 2 errors. What I mean is that if one of the highest 3 probabilities is the correct label I consider the prediction correct. Is there a smart way to do it in scikit-learn?
Try this approach:
In [57]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index).T
In [58]: x
Out[58]:
0 1 2
0 L A N
1 L U A
2 P A S
3 L N B
4 R S A
In [59]: x.eq(df.LABEL, axis=0).any(1)
Out[59]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
similar solution, which uses one transpose
less:
In [66]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [67]: x
Out[67]:
0 1 2 3 4
0 L L P L R
1 A U A N S
2 N A S B A
In [68]: x.eq(df.LABEL).any()
Out[68]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
Source DF:
In [70]: df
Out[70]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
UPDATE: trying to reproduce the error (from comments):
In [81]: df
Out[81]:
a b c d e LABEL
0 1 2 3 4 5 c
1 3 4 5 6 7 d
In [82]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [83]: x
Out[83]:
0 1
0 e e
1 d d
2 c c
In [84]: x.eq(df.LABEL).any()
Out[84]:
0 True
1 True
dtype: bool
PS I'm using Pandas 0.23.0
If performance is important use numpy.argsort
with remove last column by iloc
:
print (np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3])
[[ 7 0 8]
[ 7 12 0]
[ 9 0 11]
[ 7 8 1]
[10 11 0]]
v = df.columns[np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3]]
print (v)
Index([['L', 'A', 'N'], ['L', 'U', 'A'], ['P', 'A', 'S'], ['L', 'N', 'B'],
['R', 'S', 'A']],
dtype='object')
a = pd.DataFrame(v).eq(df['LABEL'], axis=0).any(axis=1)
print (a)
0 True
1 True
2 False
3 False
4 False
dtype: bool
Thanks, @Maxu for another similar solution with numpy.argpartition
:
v = df.columns[np.argpartition(-df.iloc[:, :-1].values, 3, axis=1)[:,:3]]
Sample data:
df = pd.DataFrame({'A': [0.718363, 0.677776, 0.6820189999999999, 0.48298100000000005, 0.700207], 'B': [0.5, 0.5, 0.5, 0.5, 0.5], 'C': [0.403466, 0.366128, 0.074347, 0.06543600000000001, 0.515825], 'D': [0.5, 0.5, 0.5, 0.5, 0.5], 'E': [0.5, 0.5, 0.5, 0.5, 0.5], 'G': [0.45898900000000004, 0.042405, 0.562217, 0.112383, 0.07808899999999999], 'H': [0.5, 0.5, 0.5, 0.5, 0.5], 'L': [0.85019, 0.8942, 0.417786, 0.7436590000000001, 0.43783900000000003], 'N': [0.6208779999999999, 0.510644, 0.539949, 0.604382, 0.249892], 'P': [0.182169, 0.015788999999999997, 0.7992229999999999, 0.24676599999999999, 0.064734], 'R': [0.48363100000000003, 0.523462, 0.603212, 0.39907, 0.8228340000000001], 'S': [0.43291499999999994, 0.547838, 0.6208060000000001, 0.34108099999999997, 0.769277], 'U': [0.328495, 0.691239, 0.335204, 0.22940700000000003, 0.512239], 'V': [0.5, 0.5, 0.5, 0.5, 0.5], 'LABEL': ['A', 'L', 'G', 'P', 'U']})
print (df)
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U
I can't think of a solution in sklearn
so here's one in pandas
# Data
predictions
Out[]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
# Check if the label is in the top 3 (one line solution)
predictions.apply(lambda row: row['LABEL'] in list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 True
1 True
2 False
3 False
4 False
Here is what is happening:
# List the top 3 results:
predictions.apply(lambda row: list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 [N, A, L]
1 [A, U, L]
2 [S, A, P]
3 [V, N, L]
4 [A, S, R]
# Then check if the 'LABEL' is inside this list
You could ask this question on Cross Validated as they will use sklearn extensively
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.