简体   繁体   中英

Difficult reinforcement learning query

I'm struggling to figure out how I want to do this so I hope someone here may offer some guidance.

Scenario - I have a 10 character string, lets call it the DNA, made up of the following characters:

F
-
+
[
]
X

for example DNA = ['F', 'F', '+', '+', '-', '[', 'X', '-', ']', '-']

Now these DNA strings get converted to physical representations from whence I can get a fitness or reward value. So an RL flowchart for this scenario would look like this:

PS The maximum fitness is not known or specified.

Step 1: Get random DNA string

Step 2: Compute fitness

Step 3: Get another random DNA string

Step 4: Compute fitness

Step 5: Compute gradient and see which way is up

Step 6: Train ML algorithm to generate better and better DNA strings until fitness no longer increases

For clarity sake the best DNA string, ie the one who will return the highest fitness, for my purposes now is:
['F', 'X', 'X', 'X', 'X', 'F', 'X', 'X', 'X', 'X']

How can I train a ML algorithm to learn this and output this DNA string?

I'm trying to wrap my brain around Policy Gradient methods but what will my input to the ML algorithm be? There are no states like in the OpenAI Gym examples.

EDIT: Final goal - Algorithm that learns to generate higher fitness value DNA strings. This has to happen without any human supervision ie NOT supervised learning but reinforcement learning.

Akin to a GA that will evolve better and better DNA strings

Assuming that the problem is to mutate a given string into another string which has a higher fitness value, the Markov Decision Process can be modeled as:

  • Initial State: A random DNA string.
  • Action: Mutate into another string which is similar to the original one but (ideally) with a higher fitness value.
  • State: the strings generated by the agent
  • Done Signal: When more than 5 (can set to any value) characters are changed in the original random string at the start of the episode.
  • Reward: fitness(next_state) - fitness(state) + similarity(state,next_state) OR fitness(next_state) - fitness(state)

You could start with Q-learning with discrete actions of dimension:10 and each action having 6 choices: (F, -, +, [, ], X)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM