Need help with flickr8 dataset tokenizing words

hskramer · April 30, 2019, 2:21am

I have been working with on this for a couple of weeks now. I have accomplished a lot but I’m stuck on how to process/tokenize the words. I have been following the Word Embedding tutorial my vocab is good idx_to_counts works but stuck on tokenize most common error: list’ object has no attribute ‘transform’ .

NRauschmayr · April 30, 2019, 4:57pm

Can you provide a small reproducible example? Do you use a tokenizer provided by GluonNLP https://gluon-nlp.mxnet.io/api/data.html#transforms ?

hskramer · April 30, 2019, 9:30pm

This is straight out of word embeddings the counter and vocab work fine. I’m try to figure out how to transform the sentences into tokens I was also thinking of trying the SpacyTokenizer this is my first attempt at NLP, I’m comfortable with images and convolution based nets but NLP is much more difficult. My long term goal is to reproduce One neural network, many uses:https://towardsdatascience.com/one-neural-network-many-uses-image-captioning-image-search-similar-image-and-words-in-one-model-1e22080ce73d

counter = nlp.data.count_tokens(itertools.chain.from_iterable(descriptions))
vocab = nlp.Vocab(counter, unknown_token=None, padding_token=None, bos_token=’’, eos_token=’’, min_freq=5) this return almost exactly the same number of words that articles from the arxiv I’ve read have.

idx_to_counts = [counter[w] for w in vocab.idx_to_token]
def code(sentence):
return [vocab[token] for token in lines if token in vocab]

flickr8 = lines.transform(code, lazy=False)
this fails list error. I have close to 40 functions that convert the text in so many different ways that I’m stuck on which to use and the order to use them in.

NRauschmayr · May 1, 2019, 4:36am

The reason why you get the error is, that lines is a list of tokens. You have to call transform on a nlp.data-object.
The following should work:

text8 = nlp.data.Text8()
def code(sentence):
   return [vocab[token] for token in sentence if token in vocab]
text8 = text8.transform(code, lazy=False)

hskramer · May 2, 2019, 12:25am

There are no errors but this doesn’t tokenize the Flickr8 dataset it
returns a long string of mostly repeating numbers. I’m so close I can retrieve an Image/caption and normalize the image pass it through a dense layer. I’m going to use the merge method I just need a way to tokenize the Flickr8 captions. If the above doesn’t work can you point me in the right directions.

Topic		Replies	Views
Documentation for Im2Rec.py? Discussion	3	1352	October 28, 2018
Im2rec.py I surrender! Generated list is incorrect Discussion	4	1708	October 15, 2018
Language Model Data Sets D2L Book	2	1024	September 29, 2019
Create a model to classificy a sentence logical or not Discussion	2	392	June 7, 2019
HW8.3 Clarification Courses	10	906	April 16, 2019

Need help with flickr8 dataset tokenizing words

Related Topics