简体   繁体   English

使用TensorFlow进行多标签文本分类

[英]Multilabel Text Classification using TensorFlow

The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. 文本数据被组织为具有20,000个元素的向量,例如[2,1,0,0,5,....,0]。 i-th element indicates the frequency of the i-th word in a text. 第i个元素指示文本中第i个单词的频率。

The ground truth label data is also represented as vector with 4,000 elements, like [0, 0, 1, 0, 1, ...., 0]. 地面真相标签数据也表示为具有4,000个元素的向量,例如[0,0,1,0,1,....,0]。 i-th element indicates whether the i-th label is a positive label for a text. 第i个元素指示第i个标签是否为文本的肯定标签。 The number of labels for a text differs depending on texts. 文本的标签数取决于文本。

I have a code for single-label text classification. 我有一个用于单标签文本分类的代码。

How can I edit the following code for multilabel text classification? 如何编辑以下代码进行多标签文本分类?

Especially, I would like to know following points. 特别是,我想知道以下几点。

  • How to compute accuracy using TensorFlow. 如何使用TensorFlow计算准确性
  • How to set a threshold which judges whether a label is positive or negative. 如何设置一个阈值来判断标签是正还是负。 For instance, if the output is [0.80, 0.43, 0.21, 0.01, 0.32] and the ground truth is [1, 1, 0, 0, 1], the labels with scores over 0.25 should be judged as positive. 例如,如果输出为[0.80、0.43、0.21、0.01、0.32],并且基本事实为[1、1、0、0、1],则得分大于0.25的标签应被判断为正。

Thank you. 谢谢。

import tensorflow as tf

# hidden Layer
class HiddenLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05))
        b_h = tf.Variable(tf.zeros([n_out]))

        self.w = w_h
        self.b = b_h
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# output Layer
class OutputLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_o = tf.Variable(tf.random_normal([n_in, n_out], mean = 0.0, stddev = 0.05))
        b_o = tf.Variable(tf.zeros([n_out]))

        self.w = w_o
        self.b = b_o
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# model
def model():
    h_layer = HiddenLayer(input = x, n_in = 20000, n_out = 1000)
    o_layer = OutputLayer(input = h_layer.output(), n_in = 1000, n_out = 4000)

    # loss function
    out = o_layer.output()
    cross_entropy = -tf.reduce_sum(y_*tf.log(out + 1e-9), name='xentropy')    

    # regularization
    l2 = (tf.nn.l2_loss(h_layer.w) + tf.nn.l2_loss(o_layer.w))
    lambda_2 = 0.01

    # compute loss
    loss = cross_entropy + lambda_2 * l2

    # compute accuracy for single label classification task
    correct_pred = tf.equal(tf.argmax(out, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, "float"))

    return loss, accuracy

You have to use variations of cross entropy function in other to support multilabel classification. 您必须使用其他形式的交叉熵函数来支持多标签分类。 In case you have less than one thousand of ouputs you should use sigmoid_cross_entropy_with_logits , in your case that you have 4000 outputs you may consider candidate sampling as it is faster than the previous. 如果输出数量少于一千,则应使用sigmoid_cross_entropy_with_logits ,如果输出数量为4000,则可以考虑候选采样,因为它比以前更快。

How to compute accuracy using TensorFlow. 如何使用TensorFlow计算准确性

This depends on your problem and what you want to achieve. 这取决于您的问题和想要实现的目标。 If you don't want to miss any object in an image then if the classifier get all right but one, then you should consider the whole image an error. 如果您不希望丢失图像中的任何对象,那么如果分类器除一个之外一切正常,则应将整个图像视为错误。 You can also consider that an object missed or missclassiffied is an error. 您还可以认为错过或错过分类的对象是错误。 The latter I think it supported by sigmoid_cross_entropy_with_logits. 我认为后者受sigmoid_cross_entropy_with_logits支持。

How to set a threshold which judges whether a label is positive or negative. 如何设置一个阈值来判断标签是正还是负。 For instance, if the output is [0.80, 0.43, 0.21, 0.01, 0.32] and the ground truth is [1, 1, 0, 0, 1], the labels with scores over 0.25 should be judged as positive. 例如,如果输出为[0.80、0.43、0.21、0.01、0.32],并且基本事实为[1、1、0、0、1],则得分大于0.25的标签应被判断为正。

Threshold is one way to go, you have to decided which one. 阈值是一种解决方法,您必须确定哪种方法。 But that is some kind of hack, not real multilable classification. 但这是某种黑客手段,不是真正的多分类。 For that you need the previous functions I said before. 为此,您需要我之前说过的以前的功能。

Change relu to sigmoid of output layer. 将relu更改为输出层的S形。 Modify cross entropy loss to explicit mathematical formula of sigmoid cross entropy loss (explicit loss was working in my case/version of tensorflow ) 将交叉熵损失修改为S型交叉熵损失的显式数学公式(在我的情况下/张量流版本中,显性损失有效)

import tensorflow as tf

# hidden Layer
class HiddenLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05))
        b_h = tf.Variable(tf.zeros([n_out]))

        self.w = w_h
        self.b = b_h
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# output Layer
class OutputLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_o = tf.Variable(tf.random_normal([n_in, n_out], mean = 0.0, stddev = 0.05))
        b_o = tf.Variable(tf.zeros([n_out]))

        self.w = w_o
        self.b = b_o
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        #changed relu to sigmoid
        self.output = tf.nn.sigmoid(linarg)

        return self.output

# model
def model():
    h_layer = HiddenLayer(input = x, n_in = 20000, n_out = 1000)
    o_layer = OutputLayer(input = h_layer.output(), n_in = 1000, n_out = 4000)

    # loss function
    out = o_layer.output()
    # modified cross entropy to explicit mathematical formula of sigmoid cross entropy loss
    cross_entropy = -tf.reduce_sum( (  (y_*tf.log(out + 1e-9)) + ((1-y_) * tf.log(1 - out + 1e-9)) )  , name='xentropy' )    

    # regularization
    l2 = (tf.nn.l2_loss(h_layer.w) + tf.nn.l2_loss(o_layer.w))
    lambda_2 = 0.01

    # compute loss
    loss = cross_entropy + lambda_2 * l2

    # compute accuracy for single label classification task
    correct_pred = tf.equal(tf.argmax(out, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, "float"))

    return loss, accuracy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM