My DataFrame represents attributes in each column and yes
/ no
-values in each row if applicable:
d_att = { 'attribute1': ['yes', 'yes', 'no'],
'attribute2': ['no', 'yes', 'no'],
'attribute3': ['no', 'no', 'yes'] }
df_att = pd.DataFrame(data=d_att)
df_att
attribute1 attribute2 attribute3
0 yes no no
1 yes yes no
2 no no yes
Now I need to calculate the likelihood of each combination of attributes, eg if attribute1
is yes
then the likelihood of attribute2
also being yes
is 0.5.
I'm aiming for a DataFrame like this:
attribute1 attribute2 attribute3
attribute1 1.0 0.5 0.0
attribute2 1.0 1.0 0.0
attribute3 0.0 0.0 1.0
So far I started by replacing the yes
/ no
-values with integers ( 1
/ 0
):
df_att_int = df_att.replace({'no': 0, 'yes': 1})
df_att_int
attribute1 attribute2 attribute3
0 1 0 0
1 1 1 0
2 0 0 1
Then I defined a method that loops over each column, filters the DataFrame for rows with value 1
in the current column, calculates the sum for each column in the filtered DataFrame, and divides the sum(s) by the number of filtered rows (= sum
) for the current column:
def combination_likelihood(df):
df_dict = {}
for column in df.columns:
col_sum = df[df[column]==1].sum()
divisor = col_sum[column]
df_dict[column] = col_sum.apply(lambda x: x/divisor)
return pd.DataFrame(data=df_dict).T
Applying the method on my df_att_int
-DataFrame delivers the expected result:
df_att_comb_like = combination_likelihood(df_att_int)
df_att_comb_like
attribute1 attribute2 attribute3
attribute1 1.0 0.5 0.0
attribute2 1.0 1.0 0.0
attribute3 0.0 0.0 1.0
However, if the attribute/column-names are not in alphabetical order, rows will be sorted by label and the characteristical pattern needed for insightful plots will be lost, for example resulting in following structure:
attribute2 attribute3 attribute1
attribute1 0.5 0.0 1.0
attribute2 1.0 0.0 1.0
attribute3 0.0 1.0 0.0
Ultimately, I want to plot out the result as a heatmap:
import seaborn as sns
sns.heatmap(df_att_comb_like)
Is there an easier, more elegant way to construct the likelihood-DataFrame and preserving the same order for columns and row-labels? Any help would be greatly appreciated!
While I put together something nicer
df_att.eq('yes').astype(int) \
.pipe(lambda d: d.T.dot(d)) \
.pipe(lambda d: d.div(d.max(1), 0))
attribute1 attribute2 attribute3
attribute1 1.0 0.5 0.0
attribute2 1.0 1.0 0.0
attribute3 0.0 0.0 1.0
Make the dataframe an integer mask
d = df_att.eq('yes').astype(int)
d
attribute1 attribute2 attribute3
0 1 0 0
1 1 1 0
2 0 0 1
Dot product with itself
d2 = d.T.dot(d)
d2
attribute1 attribute2 attribute3
attribute1 2 1 0
attribute2 1 1 0
attribute3 0 0 1
Divide each row with the maximum of that row
d2.div(d2.max(axis=1), axis=0)
attribute1 attribute2 attribute3
attribute1 1.0 0.5 0.0
attribute2 1.0 1.0 0.0
attribute3 0.0 0.0 1.0
This is very similar to the Machine Learning alg. called " perceptron " which corrects the mean function with each datapoint. If you get a hold of a pdf of python machine learning by Sebastian Raschka you can see this implementation on page 25, you may want to read about the Perceptron rule. You might implement this looping with a lambda function, a for loop or in many other ways.
Threshold function is a term I might want to check on your condition too, since it is very close to what you are implementing.
for _ in range(self.n_iter):
errors = 0
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi))
self.w_[1:] += update * xi
self.w_[0] += update
errors += int(update != 0.0)
self.errors_.append(errors)
return self
lines 125 to 133
Also a notebook link which further explain steps here: ipyn
In the code I list here a for loop has been chosen as the implementation. Personally I would apply a lambda function or a map () one.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.