RL algorithm in Gluon

Hi all,
I am working on the development of new algorithms for RL.
I tried to look around for examples and solutions for similar problems, but I did not really find much.

My problem is a bit unusual: the learner has to pick and action from a set of available ones and each action is characterized by a vector of d dimensions.
Basically, each action needs to be passed to the same NN, which will generated a smaller representation and later these representations will be used to sample the action by using a very complex version of softmax.

Just to make it simpler: let’s assume the network is just a single dense block with two outputs and that I have a gluon block which can receive a 2D vector and returns some kind of score.
Currently I just call the network for every action and then I have some logic handling the outputs, but doesn’t look like the correct way of doing it. I would like to create a gluon block which gets an nd array (one row per action) and returns the index of the selected one.

My questions are:

  1. I am not a NN expert: does this make sense to you? I know it makes sense from the ML point of view but I am not sure the software structure makes any sense.
  2. What do you think it’s the best way to structure these blocks?
  3. I saw a few RL examples, but TBH some of them seems quite ‘basic’. Do you have any good reference?
  4. I guess this can be seen as a multi-task regression problem where the tasks are all the same and the parameters shared, but seems overkilling and I would still need to select the action from the scores at the end. Any thought on this?

You can surely have a block in Gluon that does what you suggest. A basic gluon block (using your terminology) would like that:

class ActionPicker(gluon.Block):
    def __init__(self):
        super(ActionPicker, self).__init__()
        self.encoder = sameNNLayer()
        self.sampler = veryComplexVersionOfSoftmax()

    def forward(self, actions):
        # actions: (n_action, action_d)
        representations = self.encoder(actions)
        # representations: (n_action, action_d2)
        samples = self.sampler(representations)
        # samples: (n_action, 1)
        index = samples.topk(axis=0)
        return index

What is important is to have the operations that you apply to your actions to be done with the NDArray API rather than numpy. If you use .asnumpy() this will force the data back to the CPU and can dramatically slow down your network.

Given it is an RL task, how do you take into account the state?

Maybe worth checking these examples: https://github.com/chinakook/Awesome-MXNet#7-drl

This looks really trivial. Thanks.
The state is not going to be a problem, I just gave you a simplified version of the real setting.

Just a quick question: what if the sampler has a state?
Let’s say that every time it gets invoked I want to increment a variable inside the sampler.
Do you suggest to keep an int and just manipulate it or I should create a parameter for it?
I don’t see an easy way to control parameters with an arbitrary logic (I saw I can avoid attaching the gradient to this).