简体   繁体   English

创建混淆矩阵

[英]Create a confusion matrix

First note: I am creating a recommender system.第一个注意事项:我正在创建一个推荐系统。 This should later suggest articles to the user that they might also like.这应该稍后向用户推荐他们可能也喜欢的文章。


I'm in the process of creating a confusion matrix.我正在创建一个混淆矩阵。 Unfortunately I am not able to make it.不幸的是,我无法做到。 I'm getting an error.我收到一个错误。 I have attached an example below, unfortunately I don't know how to rebuild my code.我在下面附上了一个例子,不幸的是我不知道如何重建我的代码。

  • How do I create a confusion matrix based on my existing data?如何根据现有数据创建混淆矩阵?

How do I have to rebuild it to get such a "nice" confusion matrix like in the example?我必须如何重建它才能像示例中那样获得如此“漂亮”的混淆矩阵?

Dataframe :数据框

d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
     'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)

   purchaseid  itemid
0           0       3
1           0       8
2           0       2
3           1      10
4           2       3
...         ...    ...

Code :代码

PERCENTAGE_SPLIT = 20
    NUM_NEGATIVES = 4
    def splitter(df):
      df_ = pd.DataFrame()
      sum_purchase = df['purchaseid'].nunique()
      amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
    
      random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
      df_ = df.loc[df['purchaseid'].isin(random_list)]
      df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
      return [df_reduced, df_]
    
    def generate_matrix(df_main, dataframe, name):
      
      mat = sp.dok_matrix((df_main.shape[0], len(df_main['itemid'].unique())), dtype=np.float32)
      for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
        mat[purchaseid, itemid] = 1.0
    
      return mat
    
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
    
train_mat = generate_matrix(df, df_tr, 'train')
val_mat = generate_matrix(df, df_val, 'val')
num_users, num_items = train_mat.shape


def get_train_samples(train_mat, num_negatives):
    user_input, item_input, labels = [], [], []
    num_user, num_item = train_mat.shape
    for (u, i) in train_mat.keys():
        user_input.append(u)
        item_input.append(i)
        labels.append(1)
        # negative instances
        for t in range(num_negatives):
            j = np.random.randint(num_item)
            while (u, j) in train_mat.keys():
                j = np.random.randint(num_item)
            user_input.append(u)
            item_input.append(j)
            labels.append(0)
    return user_input, item_input, labels

user_input, item_input, labels = get_train_samples(train_mat, NUM_NEGATIVES)
val_user_input, val_item_input, val_labels = get_train_samples(val_mat, NUM_NEGATIVES)

hist = model.fit([np.array(user_input), np.array(item_input)], np.array(labels),
                 validation_data=([np.array(val_user_input), np.array(val_item_input)], np.array(val_labels)))

from sklearn.metrics import classification_report
x_train = user_input, item_input
y_train = labels
x_test = val_user_input, val_item_input
y_test = val_labels


y_pred = model.predict(([np.array(val_user_input), np.array(val_item_input)], np.array(val_labels)), batch_size=64, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)

print(classification_report(y_test, y_pred_bool))

Example :示例

import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01).fit(X_train, y_train)

np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
title = ("Confusion matrix, without normalization")


disp = plot_confusion_matrix(classifier, X_test, y_test, display_labels=class_names, cmap=plt.cm.Blues,)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)

plt.show()

在此处输入图片说明

Try :尝试

import seaborn as sns
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, roc_curve, auc
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)

[OUT] ValueError: Classification metrics can't handle a mix of binary and continuous targets

EDIT :编辑

I'm using the NCF model我正在使用 NCF 模型

在此处输入图片说明

The architecture of a Neural Collaborative Filtering model.神经协同过滤模型的架构。 Taken from the Neural Collaborative Filtering paper .摘自神经协同过滤论文

# full NCF model
def get_model(num_users, num_items, latent_dim=8, dense_layers=[64, 32, 16, 8],
              reg_layers=[0, 0, 0, 0], reg_mf=0):

    # input layer
    input_user = Input(shape=(1,), dtype='int32', name='user_input')
    input_item = Input(shape=(1,), dtype='int32', name='item_input')
    
    # embedding layer
    mf_user_embedding = Embedding(input_dim=num_users, output_dim=latent_dim,
                        name='mf_user_embedding',
                        embeddings_initializer='RandomNormal',
                        embeddings_regularizer=l2(reg_mf), input_length=1)
    mf_item_embedding = Embedding(input_dim=num_items, output_dim=latent_dim,
                        name='mf_item_embedding',
                        embeddings_initializer='RandomNormal',
                        embeddings_regularizer=l2(reg_mf), input_length=1)
    mlp_user_embedding = Embedding(input_dim=num_users, output_dim=int(dense_layers[0]/2),
                         name='mlp_user_embedding',
                         embeddings_initializer='RandomNormal',
                         embeddings_regularizer=l2(reg_layers[0]), 
                         input_length=1)
    mlp_item_embedding = Embedding(input_dim=num_items, output_dim=int(dense_layers[0]/2),
                         name='mlp_item_embedding',
                         embeddings_initializer='RandomNormal',
                         embeddings_regularizer=l2(reg_layers[0]), 
                         input_length=1)

    # MF latent vector
    mf_user_latent = Flatten()(mf_user_embedding(input_user))
    mf_item_latent = Flatten()(mf_item_embedding(input_item))
    mf_cat_latent = Multiply()([mf_user_latent, mf_item_latent])

    # MLP latent vector
    mlp_user_latent = Flatten()(mlp_user_embedding(input_user))
    mlp_item_latent = Flatten()(mlp_item_embedding(input_item))
    mlp_cat_latent = Concatenate()([mlp_user_latent, mlp_item_latent])
    
    mlp_vector = mlp_cat_latent
    
    # build dense layer for model
    for i in range(1,len(dense_layers)):
        layer = Dense(dense_layers[i],
                      activity_regularizer=l2(reg_layers[i]),
                      activation='relu',
                      name='layer%d' % i)
        mlp_vector = layer(mlp_vector)

    predict_layer = Concatenate()([mf_cat_latent, mlp_vector])
    result = Dense(1, activation='sigmoid', 
                   kernel_initializer='lecun_uniform',name='result')

    model = Model(inputs=[input_user,input_item], outputs=result(predict_layer))

    return model

The output of your model, y_pred are not class values, but (as a softmax layer with one neuron is used), can be interpreted as class probabilities.模型的输出y_pred不是类值,但是(因为使用了一个带有一个神经元的 softmax 层),可以解释为类概率。 Particularly, the is not one confusion matrix to compute, you first have to select a discrimination threshold, that is, a value between 0 and 1 which defines which of the continuous values in your prediction should assigned to which class.特别是,这不是一个要计算的混淆矩阵,您首先必须选择一个区分阈值,即一个介于01之间的值,用于定义预测中的哪个连续值应分配给哪个类。

Assume you found out that 0.8 is a good threshold, that is, every entry in y_pred larger as 0.8 is assigned to class 1 , every entry smaller is assigned to class 0 , then the following should compute the confusion matrix for that particular threshold:假设您发现0.8是一个很好的阈值,即y_pred大于0.8每个条目都分配给第1类,每个较小的条目分配给第0类,那么以下应该计算该特定阈值的混淆矩阵:

import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, (y_pred > 0.8).astype(np.int))
sns.heatmap(cm, annot=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM