简体   繁体   English

自定义环境 Gym,用于使用 DDPG Agent 进行步进功能处理

[英]Custom environment Gym for step function processing with DDPG Agent

I'm new to reinforcement learning, and I would like to process audio signal using this technique.我是强化学习的新手,我想使用这种技术处理音频信号。 I built a basic step function that I wish to flatten to get my hands on Gym OpenAI and reinforcement learning in general.我构建了一个基本的 step 函数,我希望将其展平,以便我掌握Gym OpenAI和一般强化学习。

To do so, I am using the GoalEnv provided by OpenAI since I know what the target is, the flat signal.为此,我使用了GoalEnv提供的OpenAI因为我知道目标是什么,平坦信号。 That is the image with input and desired signal :那是带有输入和所需信号的图像:

图片取自 https://imgur.com/pgdlTWK

The step function calls _set_action which performs achieved_signal = convolution(input_signal,low_pass_filter) - offset , low_pass_filter takes a cutoff frequency as input as well.步骤函数调用_set_action ,它执行achieved_signal = convolution(input_signal,low_pass_filter) - offset ,低通滤波器也将截止频率作为输入。 Cutoff frequency and offset are the parameters that act on the observation to get the output signal.截止频率和偏移是作用于观察以获得输出信号的参数。 The designed reward function returns the frame to frame L2-norm between the input signal and the desired signal, to the negative, to penalize a large norm.设计的奖励函数将输入信号和所需信号之间的帧返回到帧L2-norm ,为负值,以惩罚大范数。

Following is the environment I created:以下是我创建的环境:

def butter_lowpass(cutoff, nyq_freq, order=4):
    normal_cutoff = float(cutoff) / nyq_freq
    b, a = signal.butter(order, normal_cutoff, btype='lowpass')
    return b, a

def butter_lowpass_filter(data, cutoff_freq, nyq_freq, order=4):
    b, a = butter_lowpass(cutoff_freq, nyq_freq, order=order)
    y = signal.filtfilt(b, a, data)
    return y

class `StepSignal(gym.GoalEnv)`:

    def __init__(self, input_signal, sample_rate, desired_signal):
        super(StepSignal, self).__init__()

        self.initial_signal = input_signal
        self.signal = self.initial_signal.copy()
        self.sample_rate = sample_rate
        self.desired_signal = desired_signal
        self.distance_threshold = 10e-1

        max_offset = abs(max( max(self.desired_signal) , max(self.signal))
                 - min( min(self.desired_signal) , min(self.signal)) )

        self.action_space = spaces.Box(low=np.array([10e-4,-max_offset]),\
high=np.array([self.sample_rate/2-0.1,max_offset]), dtype=np.float16)

        obs = self._get_obs()
        self.observation_space = spaces.Dict(dict(
        desired_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        achieved_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        observation=spaces.Box(-np.inf, np.inf, shape=obs['observation'].shape, dtype='float32'),
        ))

    def step(self, action):
        range = self.action_space.high - self.action_space.low
        action = range / 2 * (action + 1)
        self._set_action(action)
        obs = self._get_obs()
        done = False

        info = {
                'is_success': self._is_success(obs['achieved_goal'], self.desired_signal),
               }
        reward = -self.compute_reward(obs['achieved_goal'],self.desired_signal)
        return obs, reward, done, info

    def reset(self):
        self.signal = self.initial_signal.copy()
        return self._get_obs()


    def _set_action(self, actions):
        actions = np.clip(actions,a_max=self.action_space.high,a_min=self.action_space.low)
        cutoff = actions[0]
        offset = actions[1]
        print(cutoff, offset)
        self.signal = butter_lowpass_filter(self.signal, cutoff, self.sample_rate/2) - offset

    def _get_obs(self):
        obs = self.signal
        achieved_goal = self.signal
        return {
        'observation': obs.copy(),
        'achieved_goal': achieved_goal.copy(),
        'desired_goal': self.desired_signal.copy(),
        }

    def compute_reward(self, goal_achieved, goal_desired):
        d = np.linalg.norm(goal_desired-goal_achieved)
        return d


    def _is_success(self, achieved_goal, desired_goal):
        d = self.compute_reward(achieved_goal, desired_goal)
        return (d < self.distance_threshold).astype(np.float32)

The environment can then be instantiated into a variable, and flattened through the FlattenDictWrapper as advised here https://openai.com/blog/ingredients-for-robotics-research/ (end of the page).然后可以将环境实例化为变量,并按照此处https://openai.com/blog/ingredients-for-robotics-research/ (页面末尾)的建议通过FlattenDictWrapper平。

length = 20
sample_rate = 30 # 30 Hz
in_signal_length = 20*sample_rate # 20sec signal
x = np.linspace(0, length, in_signal_length)

# Desired output
y = 3*np.ones(in_signal_length)
# Step signal
in_signal = 0.5*(np.sign(x-5)+9)

env = gym.make('stepsignal-v0', input_signal=in_signal, sample_rate=sample_rate, desired_signal=y)
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=['observation','desired_goal'])
env.reset()

The agent is a DDPG Agent from keras-rl , since the actions can take any values in the continuous action_space described in the environment.代理是来自keras-rl的 DDPG 代理,因为操作可以采用环境中描述的连续 action_space 中的任何值。 I wonder why the actor and critic nets need an input with an additional dimension, in input_shape=(1,) + env.observation_space.shape我想知道为什么演员和评论家网络需要一个额外维度的输入,在input_shape=(1,) + env.observation_space.shape

nb_actions = env.action_space.shape[0]

# Building Actor agent (Policy-net)
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape, name='flatten'))
actor.add(Dense(128))
actor.add(Activation('relu'))
actor.add(Dense(64))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
actor.summary()

# Building Critic net (Q-net)
action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = Concatenate()([action_input, flattened_observation])
x = Dense(128)(x)
x = Activation('relu')(x)
x = Dense(64)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
critic.summary()

# Building Keras agent
memory = SequentialMemory(limit=2000, window_length=1)
policy = BoltzmannQPolicy()
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=0.6, mu=0, sigma=0.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                  memory=memory, nb_steps_warmup_critic=2000, nb_steps_warmup_actor=10000,
                  random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=1e-3, clipnorm=1.), metrics=['mae'])

Finally, the agent is trained:最后,对代理进行训练:

filename = 'mem20k_heaviside_flattening'
hist = agent.fit(env, nb_steps=10, visualize=False, verbose=2, nb_max_episode_steps=5)
with open('./history_dqn_test_'+ filename + '.pickle', 'wb') as handle:
        pickle.dump(hist.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
        agent.save_weights('h5f_files/dqn_{}_weights.h5f'.format(filename), overwrite=True)

Now here is the catch: the agent seems to always be stuck to the same neighborhood of output values across all episodes for a same instance of my env:现在这里有一个问题:对于我的 env 的同一个实例,代理似乎总是停留在所有剧集中的相同输出值邻域:

图片取自 https://imgur.com/kaKhZNF `

The cumulated reward is negative since I just allowed the agent to get negative rewards.累积奖励为负,因为我只允许代理获得负奖励。 I used it from https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py which is part of OpenAI code as example.我从https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py使用它作为示例,它是 OpenAI 代码的一部分。 Across one episode, I should get varying sets of actions converging towards a (cutoff_final, offset_final) that would get my input step signal close to my output flat signal, which is clearly not the case.在一集中,我应该让不同的动作集收敛到(cutoff_final,offset_final),这将使我的输入阶跃信号接近我的输出平坦信号,这显然不是这种情况。 In addition, I thought, for successive episodes, I should get different actions.另外,我想,对于连续的剧集,我应该得到不同的动作。

I wonder why the actor and critic nets need an input with an additional dimension, in input_shape=(1,) + env.observation_space.shape我想知道为什么演员和评论家网络需要一个额外维度的输入,在 input_shape=(1,) + env.observation_space.shape

I think the GoalEnv is designed with HER (Hindsight Experience Replay) in mind, since it will use the "sub-spaces" inside the observation_space to learn from sparse reward signals (there is a paper in OpenAI website that explains how HER works).我认为GoalEnv的设计考虑了HER (事后经验回放),因为它将使用observation_space空间内的“子空间”从稀疏奖励信号中学习(OpenAI 网站上有一篇论文解释了HER 的工作原理)。 Haven't look at the implementation, but my guess is that there needs to be an additional input since HER also process the "goal" parameter.还没有看实现,但我的猜测是需要额外的输入,因为HER还处理“目标”参数。

Since it seems you are not using HER (works with any off-policy algorithm, including DQN, DDPG, etc), you should handcraft an informative reward function (rewards are not binary, eg, 1 if objective achieved, 0 otherwise) and use the base Env class.由于您似乎没有使用 HER(适用于任何离策略算法,包括 DQN、DDPG 等),您应该手工制作一个信息奖励函数(奖励不是二进制的,例如,如果实现目标则为 1,否则为 0)并使用基础Env类。 The reward should be calculated inside the step method, since rewards in MDP's are functions like r(s, a, s`) you probably will have all the information you need.奖励应该在step方法中计算,因为 MDP 中的奖励是像r(s, a, s`)这样的函数您可能会拥有所需的所有信息。 Hope it helps.希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM