简体   繁体   中英

Pandas DataFrame column corrupted while calculating an addition column

I have a dataset with the following columns and rows

Scored Probabilities for Class "1"  Scored Probabilities for Class "2"  Scored Probabilities for Class "3"  Scored Labels
0.258471                0.009299                0.005433                1
0.154108                0.009577                0.527308                3
0.001949                0.634572                0.000953                2

(Actually, there are 17 "Classes", but I've simplified to 3 for this post)

I'd like to add an extra column called "Scored Label Probability" which is the max of the first three columns (actually, the max of all columns that are called "Scored Probabilities for Class "X""). So the result should look like this:-

                                        Scored Label Probability (new)
0.258471    0.009299    0.005433    1   0.258471
0.154108    0.009577    0.527308    3   0.527308
0.001949    0.634572    0.000953    2   0.634572

Here is my code (below). Unfortunately the "Scored Labels" column (the 4th column in the example data) is getting corrupted (replaced by a different integer numeric). Any suggestions on how to fix it? Thanks

# The script MUST contain a function named azureml_main
# which is the entry point for this module.

import pandas as pd
import numpy as np

# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(df = None, df2 = None):

    # First add the empty column
    df['Scored Label Probability'] = 0.0

    for rowindex, row in df.iterrows():
        max_probability =0.0
        column_value = 0.0
        column_name = ''
        for column_name, column_value in row.iteritems():
            if column_name.startswith('Scored Probabilities for Class'):
                if column_value>max_probability:
                    max_probability = column_value

        # print (max_probability,max_prob_column_name)
        df.set_value(rowindex,'Scored Label Probability',max_probability)

    # Return value must be of a sequence of pandas.DataFrame
    return df

You can make use of DF.max method along axis=1 (columns) which gives you the highest value across all the columns which start with the matching string (found using DF.filter method):

df.filter(like='Scored Probabilities for Class').max(axis=1)

0    0.258471
1    0.527308
2    0.634572
dtype: float64

Inorder to do the same using R , you can use the pmax function which returns the parallel maxima of the columns that start with the specified prefix.

Additionally using the dplyr package, we could allow select to subset and with the aid of string helpers like starts_with to do the above filter equivalent operation.

library(dplyr)
df$max <- do.call(pmax, select(df, starts_with('Scored Probabilities for Class')))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM