简体   繁体   中英

Reinforcement learning cost function

Newb question I am writing a OpenAI Gym pong player with TensorFlow and thus far have been able to create the network based on a random initialization so that it would randomly return to move the player paddle up or down.

After the epoch is over (21 games played where the computer won) I collected a set of observations, moves and scores. The final observation of a game receives a score and each preceding observation can be scored based on Bellman equation.

Now my questions what I do not understand yet: How do I calculate the cost function so that it would be propagated as a start gradient for backward propagation? I totally get it with supervised learning, but here we do not have any labels to score agains.

How would I start optimizing the network?

Maybe a pointer to existing code or some literature would help.

Here's where I compute the rewards:

def compute_observation_rewards(self, gamma, up_score_probabilities):
        """
        Applies Bellman equation and determines reward for each stored observation
        :param gamma: Learning decay
        :param up_score_probabilities: Probabilities for up score
        :returns: List of scores for each move
        """

        score_sum = 0
        discounted_rewards = []
        # go backwards through all observations
        for i, p in enumerate(reversed(self._states_score_action)):
            o = p[0]
            s = p[1]

            if s != 0:
                score_sum = 0

            score_sum = score_sum * gamma + s
            discounted_rewards.append(score_sum)

        # # normalize scores
        discounted_rewards = np.array(discounted_rewards)
        discounted_rewards -= np.mean(discounted_rewards)
        discounted_rewards /= np.std(discounted_rewards)

        return discounted_rewards

Below is my network:

with tf.variable_scope('NN_Model', reuse=tf.AUTO_REUSE):

        layer1 = tf.layers.conv2d(inputs,
                                3,
                                3,
                                strides=(1, 1),
                                padding='valid',
                                data_format='channels_last',
                                dilation_rate=(1, 1),
                                activation= tf.nn.relu, 
                                use_bias=True,
                                bias_initializer=tf.zeros_initializer(),
                                trainable=True,
                                name='layer1'
                            )
        # (N - F + 1) x (N - F + 1)
        # => layer1 should be 
        # (80 - 3 + 1) * (80 - 3 + 1) = 78 x 78

        pool1 = tf.layers.max_pooling2d(layer1,
                                        pool_size=5,
                                        strides=2,
                                        name='pool1')

        # int((N - f) / s +1) 
        # (78 - 5) / 2 + 1 = 73/2 + 1 = 37

        layer2 = tf.layers.conv2d(pool1,
                                5,
                                5,
                                strides=(2, 2),
                                padding='valid',
                                data_format='channels_last',
                                dilation_rate=(1, 1),
                                activation= tf.nn.relu, 
                                use_bias=True,
                                kernel_initializer=tf.random_normal_initializer(),
                                bias_initializer=tf.zeros_initializer(),
                                trainable=True,
                                name='layer2',
                                reuse=None
                            )

        # ((N + 2xpadding - F) / stride + 1) x ((N + 2xpadding - F) / stride + 1)
        # => layer1 should be 
        # int((37 + 0 - 5) / 2) + 1 
        # 16 + 1 = 17

        pool2 = tf.layers.max_pooling2d(layer2,
                                        pool_size=3,
                                        strides=2,
                                        name='pool2')

        # int((N - f) / s +1) 
        # (17 - 3) / 2 + 1 = 7 + 1 = 8

        flat1 = tf.layers.flatten(pool2, 'flat1')

        # Kx64

        full1 = tf.contrib.layers.fully_connected(flat1,
                                            num_outputs=1,
                                            activation_fn=tf.nn.sigmoid,
                                            weights_initializer=tf.contrib.layers.xavier_initializer(),
                                            biases_initializer=tf.zeros_initializer(),
                                            trainable=True,
                                            scope=None
                                        )

The algorithm you're looking for is called REINFORCE. I would suggest reading chapter 13 of Sutton and Barto's RL book .

Here's pseudocode from the book. 在此处输入图片说明

Here, theta is the set of weights of your neural net. If you're unfamiliar with some of the rest of the notation, I'd suggest reading Chapter 3 of the above-mentioned book. It covers the basic problem formulation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM