简体繁体中英

Difficult reinforcement learning query

原文 2019-07-25 09:33:48 2 1 reinforcement-learning/ policy-gradient-descent

I'm struggling to figure out how I want to do this so I hope someone here may offer some guidance.

Scenario - I have a 10 character string, lets call it the DNA, made up of the following characters:

F
-
+
[
]
X

for example DNA = ['F', 'F', '+', '+', '-', '[', 'X', '-', ']', '-']

Now these DNA strings get converted to physical representations from whence I can get a fitness or reward value. So an RL flowchart for this scenario would look like this:

PS The maximum fitness is not known or specified.

Step 1: Get random DNA string

Step 2: Compute fitness

Step 3: Get another random DNA string

Step 4: Compute fitness

Step 5: Compute gradient and see which way is up

Step 6: Train ML algorithm to generate better and better DNA strings until fitness no longer increases

For clarity sake the best DNA string, ie the one who will return the highest fitness, for my purposes now is:
['F', 'X', 'X', 'X', 'X', 'F', 'X', 'X', 'X', 'X']

How can I train a ML algorithm to learn this and output this DNA string?

I'm trying to wrap my brain around Policy Gradient methods but what will my input to the ML algorithm be? There are no states like in the OpenAI Gym examples.

EDIT: Final goal - Algorithm that learns to generate higher fitness value DNA strings. This has to happen without any human supervision ie NOT supervised learning but reinforcement learning.

Akin to a GA that will evolve better and better DNA strings

1 answers

Assuming that the problem is to mutate a given string into another string which has a higher fitness value, the Markov Decision Process can be modeled as:

Initial State: A random DNA string.
Action: Mutate into another string which is similar to the original one but (ideally) with a higher fitness value.
State: the strings generated by the agent
Done Signal: When more than 5 (can set to any value) characters are changed in the original random string at the start of the episode.
Reward: fitness(next_state) - fitness(state) + similarity(state,next_state) OR fitness(next_state) - fitness(state)

You could start with Q-learning with discrete actions of dimension:10 and each action having 6 choices: (F, -, +, [, ], X)

Reinforcement Learning

Reinforcement Learning or Supervised Learning?

Reinforcement learning, pendulum python

Reinforcement Learning With Variable Actions

Reinforcement learning cost function

How to apply reinforcement learning?

reinforcement learning - drive to waypoint

Negative reward in reinforcement learning

Deep Reinforcement Learning

Time step in reinforcement learning

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Reinforcement Learning Reinforcement Learning or Supervised Learning? Reinforcement learning, pendulum python Reinforcement Learning With Variable Actions Reinforcement learning cost function How to apply reinforcement learning? reinforcement learning - drive to waypoint Negative reward in reinforcement learning Deep Reinforcement Learning Time step in reinforcement learning

Related Tags

Difficult reinforcement learning query

Question

1 answers

solution1 2 ACCPTED 2019-07-26 08:45:55

solution1
2 ACCPTED 2019-07-26 08:45:55