[英]How to Record Variables in Pytorch Without Breaking Gradient Computation?
我正在嘗試實施一些類似於此的策略梯度訓練。 但是,我想在進行反向傳播之前操縱獎勵(如折扣未來總和和其他可微分操作)。
考慮定義為計算對 go 的獎勵的manipulate
function :
def manipulate(reward_pool):
n = len(reward_pool)
R = np.zeros_like(reward_pool)
for i in reversed(range(n)):
R[i] = reward_pool[i] + (R[i+1] if i+1 < n else 0)
return T.as_tensor(R)
我試圖將獎勵存儲在列表中:
#pseudocode
reward_pool = [0 for i in range(batch_size)]
for k in batch_size:
act = net(state)
state, reward = env.step(act)
reward_pool[k] = reward
R = manipulate(reward_pool)
R.backward()
optimizer.step()
似乎就地操作破壞了梯度計算,代碼給了我一個錯誤: one of the variables needed for gradient computation has been modified by an inplace operation
。
我也嘗試先初始化一個空張量,並將其存儲在張量中,但就地操作仍然是問題所在——在就地操作a view of a leaf Variable that requires grad is being used in an in-place operation.
我是 PyTorch 的新手。有人知道在這種情況下記錄獎勵的正確方法是什么嗎?
只需為每次迭代初始化空池(列表),並在計算新獎勵時將 append 初始化到池中,即
reward_pool = []
for k in batch_size:
act = net(state)
state, reward = env.step(act)
reward_pool.append(reward)
R = manipulate(reward_pool)
R.backward()
optimizer.step()
問題是由於分配給現有的 object。只需為每次迭代初始化空池(列表),並在計算新獎勵時將 append 初始化到池中,即
reward_pool = []
for k in batch_size:
act = net(state)
state, reward = env.step(act)
reward_pool.append(reward)
R = manipulate(reward_pool)
R.backward()
optimizer.step()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.