简体   繁体   中英

adding new pandas df column based on operations row-wise

I have a Dataframe like this:

Interesting           genre_1        probabilities
    1    no            Empty        0.251306
    2    yes           Empty        0.042043
    3     no          Alternative    5.871099
    4    yes         Alternative    5.723896
    5    no           Blues         0.027028
    6    yes          Blues         0.120248
    7    no          Children's     0.207213
    8    yes         Children's     0.426679
    9    no          Classical      0.306316
    10    yes         Classical      1.044135

I would like to perform GINI index on the same category based on the interesting column. After that, I would like to add such a value in a new pandas column.

This is the function to get the Gini index:

#Gini Function
#a and b are the quantities of each class
def gini(a,b):
    a1 = (a/(a+b))**2
    b1 = (b/(a+b))**2
    return 1 - (a1 + b1) 

EDIT* SORRY I had an error in my final desired Dataframe. Being interesting or not matters when it comes to choose prob(A) and prob(B) but the Gini score will be the same, because it will measure how much impurity are we getting to classify a song as interesting or not. So if the probabilities are around 50/50% then it will mean that the Gini score will reach it maximum (0.5) and this is because is equally possible to just be mistaken to choose interesting or not.

So for the first two rows, the Gini index will be:

a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612

Then I would like to get something like:

 Interesting           genre_1        percentages.  GINI INDEX
        1    no            Empty        0.251306         0.245559831601612
        2    yes           Empty        0.042043         0.245559831601612
        3     no          Alternative    5.871099         0.4999194135183881
        4    yes         Alternative    5.723896.     0.4999194135183881
        5    no           Blues         0.027028          ..
        6    yes          Blues         0.120248
        7    no          Children's     0.207213
        8    yes         Children's     0.426679
        9    no          Classical      0.306316          ..
        10    yes         Classical      1.044135         ..

I am not sure how the Interesting column plays into all of this, but I highly recommend that you make the new column by using numpy.where() . The syntax would be something like:

import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)

Ok, I think I know what you mean. The code below does not care, if the Interesting value is 'yes' or 'no'. But what you want, is to calculate the GINI coefficient in two different ways for each row based on the value in the Interesting value of that row. So if interesting == no, then the result is 0.5, because a == b. But if interesting is 'yes', then you need to use a = probability[i] and b = probability[i+1]. So skip this section for the updated code below.

import pandas as pd


df = pd.read_csv('df.txt',delim_whitespace=True)

probs = df['probabilities']


def ROLLING_GINI(probabilities):

    a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
    b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
    res = 1 - (a1 + b1)
    yield res

    for i in range(len(probabilities)-1):
        a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
        b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
        res = 1 - (a1 + b1)
        yield res


df['GINI'] = [val for val in ROLLING_GINI(probs)]

print(df)

This is where the real trouble starts, because if I understand your idea correctly, then you cannot calculate the last GINI value, because your dataframe won't allow it. The important bit here is that the last Interesting value in your dataframe is 'yes'. This means I have to use a = probability[i] and b = probability[i+1]. But your dataframe doesn't have a row number 11. You have 10 rows and on row i == 10, you'd need a probability in row 11 to calculate a GINI coefficient. So in order for your idea to work, the last Interesting value MUST be 'no', otherwise you will always get an index error.

Here's the code anyways:

import pandas as pd

df = pd.read_csv('df.txt',delim_whitespace=True)


def ROLLING_GINI(dataframe):

    probabilities = dataframe['probabilities']
    how_to_calculate = dataframe['Interesting']

    for i in range(len(dataframe)-1):

        if how_to_calculate[i] == 'yes':
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
            b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
            res = 1 - (a1 + b1)
            yield res

        elif how_to_calculate[i] == 'no':
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            res = 1 - (a1 + b1)
            yield res


GINI = [val for val in ROLLING_GINI(df)]

print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])

EDIT NUMBER THREE (Sorry for the late realization):

So it does work if I apply the indexing correctly. The problem was that I wanted to use the Next probability, not the previous one. So it's a = probabilities[i-1] and b = probabilities[i]

import pandas as pd

df = pd.read_csv('df.txt',delim_whitespace=True)


def ROLLING_GINI(dataframe):

    probabilities = dataframe['probabilities']
    how_to_calculate = dataframe['Interesting']

    for i in range(len(dataframe)):

        if how_to_calculate[i] == 'yes':
            a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
            b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
            res = 1 - (a1 + b1)
            yield res

        elif how_to_calculate[i] == 'no':
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
            res = 1 - (a1 + b1)
            yield res


GINI = [val for val in ROLLING_GINI(df)]

print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM