How to make LSTM handle with images of different sizes?


I implemented CNN LSTM model for text recognition in images. I am extracting image features with CNN and the extracted features are given to LSTM layer. When I trained the model with images of same size (128, 1600), it is doing well. But when i tried to train the model with images of different size, i am getting the following error:

AssertionError: Expected shape (800, 4000) is incompatible with given shape (800, 16384).

I am getting this error at LSTM. With image of size (128, 1600), the shape of the CNN output is (Batch_size, 32, 64, 800). I flattened this which gives (Batch_size, 1638400) and made 100 (sequence_length) splits along axis 1. The resultant ndarray of size (100, Batch_size, 16384) is sent to LSTM.

As the LSTM weights are getting initialized in the first forward pass, when the first image is of size (128, 1600), the weights are getting initialized with the size (800, 16384) and when I am trying to give image of different size, I am getting the above error.
Here 800 is: 2 (bidirectional) * 2 (Num LSTM layers) * 200 (LSTM Hidden Units)

How to resolve this issue and make LSTM handle with images of different sizes.

Any suggestions will be helpful.

Thanks in advance,

I believe you’re using Gluon. When you create your gluon.rnn.LSTM layer, do you specify that the layout as 'TNC'?

Hi @safrooze,

Thanks for replying…

I didnt specify layout in LSTM layer. But i think the default is ‘TNC’ according to the Gluon LSTM docs and i am giving the input to LSTM in the same format (‘TNC’).


@harathi You’re correct. I just read your question in detail and what you’re trying to do is invalid. Your network must have a fixed weight size. A few potential solutions:

  • Pad smaller images to a fixed large size
  • Scale images to a fixed size
  • max-pool or avg-pool the feature vectors of each sequence element into a single element.

Also looking at how you split your array, I believe you won’t get what you expect. I believe you’re trying to split the array such that each element of the sequence would be 8 columns of the image (i.e. 32x64x8). If that’s what you want, correct thing to do would be to split on last axis and then flatten the remaining axes. Alternatively you can transpose from (N,C,H,W) to (N,W,H,C) and flatten/split the same way that you do right now.

Thanks @safrooze,

That means I need to split the array along Width of the image before flattening it. Am i getting it correct?

If i have some images too small (30, 175) and some images too large (150, 1500), will padding smaller images to larger affect model learning on smaller images?

If you don’t mind, can you please explain this point…


  1. Yes split along width before flattening.
  2. Learning is not impacted. Your network has to have the capacity to learn different feature sizes. Just make sure the image sizes that you see during training are representative of image sizes that are presented during inference.
  3. If each sequence element is, say, (batchx32x64x8), you can flatten it to (batchx32x512) and apply MaxPool1D or AvgPool1D to get a single (batchx32) vector for that sample that is independent of the image dimension.

@safrooze, thanks a lot

I will try this and let you know if I get any errors.


Hi @safrooze,

To get a vector (batchx32) from (batchx32x512), we need to give stride size as the width (here 512) in MaxPool1D. Please correct me if i am wrong…


No, you’d want to set pool_size to 512 so that it would find the maximum of each channel within the 512 values.

Oh ok, got it…
Thanks @safrooze

Hi @safrooze,

Can i do the same with mx.nd.Pooling() instead of mx.gluon.nn.ManPool1D?
I did it as follows: x is convolution output (BATCH_SIZE, Channels, H1, W1)

    seqs = x.split(num_outputs=SEQ_LEN, axis = 3) # (SEQ_LEN, N, CHANNELS, 
    pooled_seqs = []
    for seq in seqs:
        seq = seq.reshape((seq.shape[0], seq.shape[1], seq.shape[2]*seq.shape[3]))
        pool_seq = mx.nd.Pooling(seq, kernel = 2, pool_size = seq.shape[2])
    x = nd.concat(*[elem.expand_dims(axis=0) for elem in pooled_seqs], dim=0)
    x = x.reshape((x.shape[0], x.shape[1], x.shape[2]))  #(SEQ_LEN, BATCH_SIZE, Channels)
    x = self.lstm(x)

I have a doubt here. Actually, I am getting 512*32 = 16384 features for every sequence and i am reducing it to 32 features. Will it not impact the model performance?

Sorry, i am asking too many questions.