简体   繁体   中英

Likelihood of combinations of values in pandas.DataFrame columns

My DataFrame represents attributes in each column and yes / no -values in each row if applicable:

d_att = { 'attribute1': ['yes', 'yes', 'no'],
          'attribute2': ['no', 'yes', 'no'],
          'attribute3': ['no', 'no', 'yes'] }

df_att = pd.DataFrame(data=d_att)
df_att

    attribute1  attribute2  attribute3
0   yes         no          no
1   yes         yes         no
2   no          no          yes

Now I need to calculate the likelihood of each combination of attributes, eg if attribute1 is yes then the likelihood of attribute2 also being yes is 0.5.

I'm aiming for a DataFrame like this:

             attribute1  attribute2  attribute3
attribute1   1.0         0.5         0.0
attribute2   1.0         1.0         0.0
attribute3   0.0         0.0         1.0

So far I started by replacing the yes / no -values with integers ( 1 / 0 ):

df_att_int = df_att.replace({'no': 0, 'yes': 1})
df_att_int 

    attribute1  attribute2  attribute3
0   1           0           0
1   1           1           0
2   0           0           1

Then I defined a method that loops over each column, filters the DataFrame for rows with value 1 in the current column, calculates the sum for each column in the filtered DataFrame, and divides the sum(s) by the number of filtered rows (= sum ) for the current column:

def combination_likelihood(df):
    df_dict = {}

    for column in df.columns:
        col_sum = df[df[column]==1].sum()
        divisor = col_sum[column]
        df_dict[column] = col_sum.apply(lambda x: x/divisor)

    return pd.DataFrame(data=df_dict).T

Applying the method on my df_att_int -DataFrame delivers the expected result:

df_att_comb_like = combination_likelihood(df_att_int)
df_att_comb_like

             attribute1  attribute2  attribute3
attribute1   1.0         0.5         0.0
attribute2   1.0         1.0         0.0
attribute3   0.0         0.0         1.0

However, if the attribute/column-names are not in alphabetical order, rows will be sorted by label and the characteristical pattern needed for insightful plots will be lost, for example resulting in following structure:

             attribute2  attribute3  attribute1
attribute1   0.5         0.0         1.0
attribute2   1.0         0.0         1.0
attribute3   0.0         1.0         0.0

Ultimately, I want to plot out the result as a heatmap:

import seaborn as sns
sns.heatmap(df_att_comb_like)

seaborn heatmap

Is there an easier, more elegant way to construct the likelihood-DataFrame and preserving the same order for columns and row-labels? Any help would be greatly appreciated!

One-liner

While I put together something nicer

df_att.eq('yes').astype(int) \
    .pipe(lambda d: d.T.dot(d)) \
    .pipe(lambda d: d.div(d.max(1), 0))

            attribute1  attribute2  attribute3
attribute1         1.0         0.5         0.0
attribute2         1.0         1.0         0.0
attribute3         0.0         0.0         1.0

Longer

Make the dataframe an integer mask

d = df_att.eq('yes').astype(int)
d

   attribute1  attribute2  attribute3
0           1           0           0
1           1           1           0
2           0           0           1

Dot product with itself

d2 = d.T.dot(d)
d2

            attribute1  attribute2  attribute3
attribute1           2           1           0
attribute2           1           1           0
attribute3           0           0           1

Divide each row with the maximum of that row

d2.div(d2.max(axis=1), axis=0)

            attribute1  attribute2  attribute3
attribute1         1.0         0.5         0.0
attribute2         1.0         1.0         0.0
attribute3         0.0         0.0         1.0

This is very similar to the Machine Learning alg. called " perceptron " which corrects the mean function with each datapoint. If you get a hold of a pdf of python machine learning by Sebastian Raschka you can see this implementation on page 25, you may want to read about the Perceptron rule. You might implement this looping with a lambda function, a for loop or in many other ways.

Threshold function is a term I might want to check on your condition too, since it is very close to what you are implementing.

[link] ( https://github.com/PacktPublishing/Python-Machine-Learning-Second-Edition/blob/master/Chapter02/ch02.py )

    for _ in range(self.n_iter):
        errors = 0
        for xi, target in zip(X, y):
            update = self.eta * (target - self.predict(xi))
            self.w_[1:] += update * xi
            self.w_[0] += update
            errors += int(update != 0.0)
        self.errors_.append(errors)
    return self

lines 125 to 133

Also a notebook link which further explain steps here: ipyn

In the code I list here a for loop has been chosen as the implementation. Personally I would apply a lambda function or a map () one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM