[英]How to get probability vector for all actions in tf-agents?
I'm working on Multi-Armed-Bandit problem, using LinearUCBAgent
and LinearThompsonSamplingAgent
but they both return a single action for an observation.我正在使用
LinearUCBAgent
和LinearThompsonSamplingAgent
解决多臂强盗问题,但它们都返回单个动作进行观察。 What I need is the probability for all the action which I can use for ranking.我需要的是可用于排名的所有动作的概率。
You need to add the emit_policy_info
argument when defining the agent.您需要在定义代理时添加
emit_policy_info
参数。 The specific values (encapsulated in a tuple) will depend on the agent: predicted_rewards_sampled
for LinearThompsonSamplingAgent
and predicted_rewards_optimistic
for LinearUCBAgent
.具体值(封装在元组中)将取决于代理:对于
predicted_rewards_sampled
的LinearThompsonSamplingAgent
和对于predicted_rewards_optimistic
的LinearUCBAgent
。
For example:例如:
agent = LinearThompsonSamplingAgent(
time_step_spec=time_step_spec,
action_spec=action_spec,
emit_policy_info=("predicted_rewards_sampled")
)
Then, during inference, you'll need to access those fields and normalize them (via softmax):然后,在推理过程中,您需要访问这些字段并对其进行规范化(通过 softmax):
action_step = agent.collect_policy.action(observation_step)
scores = tf.nn.softmax(action_step.info.predicted_rewards_sampled)
where tf
comes from import tensorflow as tf
and observation_step
is your observation array encapsulated in a TimeStep ( from tf_agents.trajectories.time_step import TimeStep
)其中
tf
来自import tensorflow as tf
, observation_step
是封装在 TimeStep 中的观察数组( from tf_agents.trajectories.time_step import TimeStep
)
Note of caution: these are NOT probabilities, they are normalized scores;注意:这些不是概率,它们是标准化分数; similar to the normalized outputs of a fully-connected layer.
类似于全连接层的归一化输出。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.