# HW8.3 Clarification

Can the course staff please clarify what this question means or point to some related resources/examples? Thank you!

1 Like

We use part of the original string to predict what comes next. e.g. after `"But B"`, it comes `"r"`, after `"ut Br"`, it comes `"u"`. We are preparing for training data here for the model to learn what comes after certain characters. The above is 5 gram, which take 5 characters sliding window for each X.

Regarding sequential encoding, do we always have a sequence of 5?
From my understanding, for " * Use a bag of characters encoding that sums over all occurrences.", we simply turned the 5 characters into a one-hot encoded vector. So for sequential encoding, can we just retain it as a matrix of shape (vocab size, 5) as the input where ith column is the one-hot vector of the ith character?

Edit: It seems to work

Thatâ€™s an interesting approach, not sure how well does it work. My understanding is to sum the 5-character encoding to a vector rather than matrix.

Yeah, 5-gram is fine.

I thought the question mentioned to use two models.

1. In one case use a sequential encoding to obtain an embedding proportional to the length of the sequence. (each example is a matrix)
2. Use a bag of characters encoding that sums over all occurrences. (each example is a vector)

And the result is consistent with our intuition in which one should work significantly better than the other.

That sounds correct to me!

If thatâ€™s the case, we would lose sequential information when we turn the matrix into a vector. E.g. â€śaabâ€ť would have same one-hot encoding as â€śbaaâ€ť and â€śabaâ€ť.

Yep, the second case will lead to this kind of information loss.

Which one should we as the feature matrix when training MLP?

The question mentioned both cases.

1 Like

Yes, train using both.