简体   繁体   中英

Pandas pivot_table: filter on aggregate function

I am trying to pass a criteria to the aggregate function to pandas pivot_table and I am not able to figure out how to pass the criteria to the aggfunc. I have a data table which is converted to pandas df.

The input table data:

col1 col2 col3 col4 col5 col6 col7
1 test1 t1 Dummy1 result1 10 102.2
2 test1 t1 Dummy2 result2 20 101.2
3 test1 t1 Dummy3 result3 30 102.3
4 test1 t1 Dummy4 result4 40 101.4
5 test2 t1 Dummy1 result1 10 100
6 test2 t1 Dummy2 result2 20 103
7 test2 t1 Dummy3 result3 30 104
8 test2 t1 Dummy4 result4 40 105
9 test3 t1 Dummy1 result1 10 102
10 test3 t1 Dummy2 result2 20 87
11 test3 t1 Dummy3 result3 30 107
12 test3 t1 Dummy5 result4 50 110.2
13 test4 t1 Dummy2 result2 20 120
14 test5 t1 Dummy6 result1 100 88
15 test1 t1 Dummy1 result2 10 106.2
16 test1 t1 Dummy1 result6 10 101.1

I want to get the maximum on col7 data, but only when the maximum is greater than 100. If any of the col7 data is greater than the user defined criteria, then all the other columns data needs to be populated irrespective if the data met the criteria or not.

Tried the following:

columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']

df = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
    'col2': ['test1', 'test1', 'test1', 'test1', 'test2', 'test2', 'test2',
             'test2', 'test3', 'test3', 'test3', 'test3', 'test4', 'test5',
             'test1', 'test1'],
    'col3': ['t1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1',
             't1', 't1', 't1', 't1', 't1'],
    'col4': ['Dummy1', 'Dummy2', 'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2',
             'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2', 'Dummy3', 'Dummy5',
             'Dummy2', 'Dummy6', 'Dummy1', 'Dummy1'],
    'col5': ['result1', 'result2', 'result3', 'result4', 'result1', 'result2',
             'result3', 'result4', 'result1', 'result2', 'result3', 'result4',
             'result2', 'result1', 'result2', 'result6'],
    'col6': [10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 50, 20, 100, 10, 10],
    'col7': [100.2, 101.2, 102.3, 101.4, 100.0, 103.0, 104.0, 105.0, 102.0,
             87.0, 107.0, 110.2, 120.0, 88.0, 106.2, 101.1]
})

res=df.pivot_table(values = 'col7', index = ['col4', 'col5', 'col6'], columns = ['col2'], fill_value = '', aggfunc = 'max' >= 100)
TypeError: '>=' not supported between instances of 'str' and 'int'

Output should look like:

Max pivoted output without col5:

col4 col6 test1 test2 test3 test4 test5
Dummy1 10 106.2 100 102 N/A N/A
Dummy2 20 101.2 103 87 120 N/A
Dummy3 30 102.3 104 107 N/A N/A
Dummy4 40 101.4 105 N/A N/A N/A
Dummy5 50 N/A N/A 110.2 N/A N/A

Max pivoted output including col5:

col4 col5 col6 test1 test2 test3 test4 test5
Dummy1 result2 10 106.2 N/A N/A N/A N/A
Dummy1 result1 10 102.2 100 102 N/A N/A
Dummy2 result2 20 101.2 103 87 120 N/A
Dummy3 result3 30 102.3 104 107 N/A N/A
Dummy4 result4 40 101.4 105 N/A N/A N/A
Dummy5 result4 50 N/A N/A 110.2 N/A N/A

Any guidance is much appreciated.

Thanks

You can't compare the word 'max' to 100 via >= ( aggfunc = 'max' >= 100 ):

I recommend not setting the fill value to a string, masking the DataFrame, to get rid of undesired rows, then replace with empty string via fillna :

columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
res = df.pivot_table(values='col7', index=['col4', 'col5', 'col6'],
                     columns=['col2'], aggfunc='max')
col2                 test1  test2  test3  test4  test5
col4   col5    col6                                   
Dummy1 result1 10    102.2  100.0  102.0    NaN    NaN
       result2 10    106.2    NaN    NaN    NaN    NaN
       result6 10    101.1    NaN    NaN    NaN    NaN
Dummy2 result2 20    101.2  103.0   87.0  120.0    NaN
Dummy3 result3 30    102.3  104.0  107.0    NaN    NaN
Dummy4 result4 40    101.4  105.0    NaN    NaN    NaN
Dummy5 result4 50      NaN    NaN  110.2    NaN    NaN
Dummy6 result1 100     NaN    NaN    NaN    NaN   88.0

Mask for any values where any res >= 100 and fillna :

res = res[(res >= 100).any(1)].fillna('')
col2                 test1  test2  test3  test4 test5
col4   col5    col6                                  
Dummy1 result1 10    102.2  100.0  102.0             
       result2 10    106.2                           
       result6 10    101.1                           
Dummy2 result2 20    101.2  103.0   87.0  120.0      
Dummy3 result3 30    102.3  104.0  107.0             
Dummy4 result4 40    101.4  105.0                    
Dummy5 result4 50                  110.2                       

Optional reset_index to clear the MultiIndex and rename_axis to clear the axis name:

res[(res >= 100).any(1)].fillna('').reset_index().rename_axis(None, axis=1)
     col4     col5  col6  test1  test2  test3  test4 test5
0  Dummy1  result1    10  102.2  100.0  102.0             
1  Dummy1  result2    10  106.2                           
2  Dummy1  result6    10  101.1                           
3  Dummy2  result2    20  101.2  103.0   87.0  120.0      
4  Dummy3  result3    30  102.3  104.0  107.0             
5  Dummy4  result4    40  101.4  105.0                    
6  Dummy5  result4    50                110.2             

Complete Working Example:

import pandas as pd

df = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
    'col2': ['test1', 'test1', 'test1', 'test1', 'test2', 'test2', 'test2',
             'test2', 'test3', 'test3', 'test3', 'test3', 'test4', 'test5',
             'test1', 'test1'],
    'col3': ['t1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1',
             't1', 't1', 't1', 't1', 't1'],
    'col4': ['Dummy1', 'Dummy2', 'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2',
             'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2', 'Dummy3', 'Dummy5',
             'Dummy2', 'Dummy6', 'Dummy1', 'Dummy1'],
    'col5': ['result1', 'result2', 'result3', 'result4', 'result1', 'result2',
             'result3', 'result4', 'result1', 'result2', 'result3', 'result4',
             'result2', 'result1', 'result2', 'result6'],
    'col6': [10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 50, 20, 100, 10, 10],
    'col7': [102.2, 101.2, 102.3, 101.4, 100.0, 103.0, 104.0, 105.0, 102.0,
             87.0, 107.0, 110.2, 120.0, 88.0, 106.2, 101.1]
})

columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
res = df.pivot_table(values='col7', index=['col4', 'col5', 'col6'],
                     columns=['col2'], aggfunc='max')
res = (
    res[(res >= 100).any(1)].fillna('').reset_index().rename_axis(None, axis=1)
)
print(res)

To get the value without col5 remove it from the index of the pivot_table :

columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
res = df.pivot_table(values='col7', index=['col4', 'col6'],
                     columns=['col2'], aggfunc='max')
res = (
    res[(res >= 100).any(1)].fillna('').reset_index().rename_axis(None, axis=1)
)
     col4  col6  test1  test2  test3  test4 test5
0  Dummy1    10  106.2  100.0  102.0             
1  Dummy2    20  101.2  103.0   87.0  120.0      
2  Dummy3    30  102.3  104.0  107.0             
3  Dummy4    40  101.4  105.0                    
4  Dummy5    50                110.2             

Or you can try:

res = df.assign(col7 = df.col7.where(df.col7 > 100)).pivot_table(values='col7', index=['col4', 'col5', 'col6'],
                     columns=['col2'], aggfunc='max', fill_value= '')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM