Compute the normalization parameter using gpu in the implementation of softmax output

In the current implementation of SoftmaxOutput operator, when the nomalization is “valid”, the code of computing the valid_cnt is as follows:

if (param_.normalization == softmatout_enum::kValid) {
int i_label = static_cast(param_.ignore_label);
Tensor<cpu, 1, DType> workspace =
ctx.requested[softmaxout_enum::kTempSpace].get_host_space_typed<1, DType>(
Copy(workspace, label, label.stream_);
for (index_t i = 0; i < label.size(0); ++i) {
if (static_cast(workspace[i]) == i_label) {
valid_cnt = valid_cnt == 0 ? 1 : valid_cnt;

And, I have some questions about this:

  1. why should copy the label into the host worksapce?
  2. Is it possible to compute the valid_cnt using gpu?
  3. Is it possible to compute the valid_cnt when computing the gradients?