Can the course staff please clarify what this question means or point to some related resources/examples? Thank you!

We use part of the original string to predict what comes next. e.g. after `"But B"`

, it comes `"r"`

, after `"ut Br"`

, it comes `"u"`

. We are preparing for training data here for the model to learn what comes after certain characters. The above is 5 gram, which take 5 characters sliding window for each X.

Regarding sequential encoding, do we always have a sequence of 5?

From my understanding, for " * Use a bag of characters encoding that sums over all occurrences.", we simply turned the 5 characters into a one-hot encoded vector. So for sequential encoding, can we just retain it as a matrix of shape (vocab size, 5) as the input where ith column is the one-hot vector of the ith character?

Edit: It seems to work

Thatâ€™s an interesting approach, not sure how well does it work. My understanding is to sum the 5-character encoding to a vector rather than matrix.

Yeah, 5-gram is fine.

I thought the question mentioned to use two models.

- In one case use a sequential encoding to obtain an embedding proportional to the length of the sequence. (each example is a matrix)
- Use a bag of characters encoding that sums over all occurrences. (each example is a vector)

And the result is consistent with our intuition in which one should work significantly better than the other.

That sounds correct to me!

If thatâ€™s the case, we would lose sequential information when we turn the matrix into a vector. E.g. â€śaabâ€ť would have same one-hot encoding as â€śbaaâ€ť and â€śabaâ€ť.

Yep, the second case will lead to this kind of information loss.

Which one should we as the feature matrix when training MLP?

The question mentioned both cases.

Yes, train using both.