Tensorflow-使用字符串标签训练神经网络

Question

For an university project I have to implement a neural network for an OCR task using Tensorflow. 对于一个大学项目，我必须使用Tensorflow为OCR任务实现神经网络。 The training dataset consists of two files, train-data.csv and train-target.csv . 训练数据集包含两个文件： train-data.csv和train-target.csv 。 In train-data file every row is filled with bits of an 16x8 bitmap, in train-target file every row is a character [az] which is the label for the corresponding row in train-data . 在火车数据文件中，每一行都填充有16x8位图的位，在火车目标文件中，每行是一个字符[az]，它是火车数据中相应行的标签。

I'm having some issues with the label dataset, I've followed the tutorial with the MNIST dataset but here the difference is that I have string labels instead of a one-hot encoded vector. 我在标签数据集上遇到了一些问题，我在教程中使用了MNIST数据集，但是这里的区别在于我有字符串标签而不是一键编码的矢量。 Following the tutorial I'm trying with the softmax function and the cross-entropy. 在学习完本教程之后，我尝试使用softmax函数和交叉熵。

# First y * tf.log(y_hat) computes the element-wise multiplication of the two resulting vectors

# Second, tf.reduce_sum( , reduction_indices=[1]) computes the sum along the second dimension (the first one are the examples)
# Finally, tf.reduce_mean() computes the mean over the first dimension, i.e. the examples
cross_entropy = tf.reduce_mean(-tf.reduce_sum(tf.strings.to_number(y) * tf.math.log(y_hat), reduction_indices=[1]))

train_step = tf.compat.v1.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

In lines above I've used tf.strings.to_number(y) to convert the char to a numeric value. 在上面的tf.strings.to_number(y)行中，我使用了tf.strings.to_number(y)将char转换为数字值。

This conversion is causing issues when I run the session because the run() method does not accept tensor objects. 由于运行（）方法不接受张量对象，因此在运行会话时此转换会引起问题。

for _ in range(1000):
    batch_xs, batch_ys = next_batch(100, raw_train_data, train_targets)
    sess.run(train_step, feed_dict={x: batch_xs, y: tf.strings.to_number(batch_ys.reshape((100,1)))})

If I don't convert the char to a numeric value I got this error: 如果我不将char转换为数值，则会出现此错误：

InvalidArgumentError: StringToNumberOp could not correctly convert string: e
 [[{{node StringToNumber}}]]

I'm trying to figure out how to solve this issue or how to train a neural network using character labels, it's the whole day that I'm working on this problem. 我试图弄清楚如何解决这个问题，或者如何使用字符标签训练神经网络，这是我整天都在努力解决的问题。 Does anyone know how to solve this? 有谁知道如何解决这个问题？

Answer 1

Finally I've found the error. 终于我找到了错误。 Because I'm quite new to machine learning I've forgot that many algorithms does not handle categorical datasets. 因为我对机器学习还很陌生，所以我忘记了许多算法不能处理分类数据集。

The solution has been to perform a one-hot encoding on the target labels and feed this new array to the newtork with this function: 解决方案是对目标标签执行一次热编码，并使用以下功能将此新数组提供给newtork：

# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz'

# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))


def one_hot_encode(data_array):
    integer_encoded = [char_to_int[char] for char in data_array]

    # one hot encode
    onehot_encoded = list()
    for value in integer_encoded:
        letter = [0 for _ in range(len(alphabet))]
        letter[value] = 1
        onehot_encoded.append(letter)

    return onehot_encoded

Tensorflow-使用字符串标签训练神经网络

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-09-05 20:41:07

Tensorflow-使用字符串标签训练神经网络

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-09-05 20:41:07

解决方案1
0 已采纳 2019-09-05 20:41:07