Is there any example implementation of Deep Deterministic Policy Gradient (DDPG) for the Gluon API? If there isn’t one, can someone help me with the implementation?
I tried to implement it by myself but I got stuck at the point where I have to update my actor network.
I implemented the following training routine:
if do_training():
# Sample random batch from replay buffer
states, actions, rewards, next_states, terminals = replay_buffer.sample(batch_size=BATCH_SIZE)
# Calculate target y with actor and critic target networks
target_actions = actor_target(next_states)
target_qvalues = critic_target(next_states, target_actions)
y = rewards + (1.0 - terminals) * DISCOUNT_FACTOR * target_qvalues
# Update critic network by minimizing reward prediction error
with autograd.record():
qvalues = critic(states, actions)
loss = l2_loss(qvalues, y)
loss.backward()
trainer_critic.step(BATCH_SIZE) # actual update with gluon.trainer
# Let actor propose particular action for given state
actor_action = actor(states)
actor_action.attach_grad()
# Compute Q(state, action) and backpropagate w.r.t. actions
with autograd.record():
qvalues = critic(states, actor_action)
qvalues.backward()
action_gradients = actor_action.grad
My first problem is, that all the gradients of action_gradients
are the same for the whole batch, so I am not sure if this is correct. My second problem is that I do not know how to proceed with the algorithm. How can I update the actor weights with the calculated gradients from the critic network?