I found this implementation somewhat contradictory to the text description.

  1. Shall we create num_heads copies of W_q, W_k, W_v? Or equivalently use hidden_size*num_heads as the first parameter to the dense layers like W_q, W_k, W_v?
  2. Shall we have num_heads copies of attention layers, instead of one?

Yeah. I also think the implementation for multi-head attention is wrong here and what you suggest is correct.

I think the implementation has the same effect as the text description.

1.For W_q,W_k,W_v, the implementation here use dense layer of hidden_size*num_heads units, while the paper use num heads copies of dense layer of hidden_size units. I think the ith copy of dense layer is equivalent to the [(i-1) *num_hidden_size:i *num_hidden_size] units in the big dense layer. There is the same effect of back propagation in training using one layer or num heads copies of small layer.
2. For attention layers,transformer use Dotproduct attention which has no parameters to train, it just compute matrix multiplication and dotproduct of inputs. Using one or more has nothing different.

