How can i optimize my Embedding transformation on a huge dataset?

Question

I use FastText from the gensim package, and I use the code below to transform my text into a dense a representation but it takes many times when I have a huge dataset. Could you help me to accelerate it?

def word2vec_features(self, templates, model):
    if self.method == 'mean':
        feats = np.vstack([sum_vectors(p, model) / len(p) for p in templates])
    else:
        feats = np.vstack([sum_vectors(p, model) for p in templates])
    return feats

def get_vect(word, model):
    try:
        return model.wv[word]
    except KeyError:
        return np.zeros((model.size,))


def sum_vectors(phrase, model):
    return sum(get_vect(w, model) for w in phrase)

Answer 1

Note that this sort of summary-vector for a text – the average (or sum) of all its word-vectors – is fairly crude. It can work OK as a baseline in some contexts – such fuzzy info-retrieval among short texts, or as a classifier input.

In some cases, if the KeyError is hit often, that exception-handling can be expensive - and it may make sense to instead check for whether a key is in the collection. But also, you may not want to be using an origin-vector (all zeros) for any missing word - it likely offers no benefit over just skipping those words.

So you might get some speedup by changing your code to ignore missing words, rather than adding an all-zeros vector in an exception handlers.

But also: if you're truly using a FastText model (rather than say Word2Vec ), it will never KeyError for an unknown word, because it will always synthesize a vector out of the character n-grams (word fragments) it learned during training. You should probably just drop your get_vect() function entirely - relying just on normal [] -access.

Further, Gensim's KeyedVector models already support returning multiple results when indexed by a list of multiple keys. And, the numpy np.sum() might work a slight bit faster on these arrays than the pure-Python sum() . So you might get a small speedup if you replace your sum_vectors() with:

def sum_vectors(phrase, model):
    return np.sum(model.wv[phrase], axis=0)

To optimize further, you might need to profile the code in a heavy-usage loop, or even reconsider whether this is the form of text-vectorization you want to pursue. (Though, better methods typically require more calculation than this simple sum/average.)

How can i optimize my Embedding transformation on a huge dataset?

Question

1 answers

solution1
2 ACCPTED 2020-11-14 01:08:48

How can i optimize my Embedding transformation on a huge dataset?

Question

1 answers

solution1 2 ACCPTED 2020-11-14 01:08:48

solution1
2 ACCPTED 2020-11-14 01:08:48