简体   繁体   English

我如何优化我在庞大数据集上的嵌入转换?

[英]How can i optimize my Embedding transformation on a huge dataset?

I use FastText from the gensim package, and I use the code below to transform my text into a dense a representation but it takes many times when I have a huge dataset.我使用来自gensim package 的 FastText,我使用下面的代码将我的文本转换为密集的表示,但是当我有一个巨大的数据集时,它需要很多次。 Could you help me to accelerate it?你能帮我加速吗?

def word2vec_features(self, templates, model):
    if self.method == 'mean':
        feats = np.vstack([sum_vectors(p, model) / len(p) for p in templates])
    else:
        feats = np.vstack([sum_vectors(p, model) for p in templates])
    return feats

def get_vect(word, model):
    try:
        return model.wv[word]
    except KeyError:
        return np.zeros((model.size,))


def sum_vectors(phrase, model):
    return sum(get_vect(w, model) for w in phrase)

Note that this sort of summary-vector for a text – the average (or sum) of all its word-vectors – is fairly crude.请注意,这种文本的摘要向量——所有词向量的平均值(或总和)——相当粗糙。 It can work OK as a baseline in some contexts – such fuzzy info-retrieval among short texts, or as a classifier input.在某些情况下,它可以作为基线工作——例如短文本中的模糊信息检索,或作为分类器输入。

In some cases, if the KeyError is hit often, that exception-handling can be expensive - and it may make sense to instead check for whether a key is in the collection.在某些情况下,如果KeyError经常被命中,那么异常处理可能会很昂贵 - 而检查集合in是否有一个键可能是有意义的。 But also, you may not want to be using an origin-vector (all zeros) for any missing word - it likely offers no benefit over just skipping those words.而且,您可能不希望对任何缺失的单词使用原始向量(全为零)——它可能比跳过这些单词没有任何好处。

So you might get some speedup by changing your code to ignore missing words, rather than adding an all-zeros vector in an exception handlers.因此,您可以通过更改代码以忽略丢失的单词来获得一些加速,而不是在异常处理程序中添加全零向量。

But also: if you're truly using a FastText model (rather than say Word2Vec ), it will never KeyError for an unknown word, because it will always synthesize a vector out of the character n-grams (word fragments) it learned during training.而且:如果你真的使用FastText model (而不是说Word2Vec ),它永远不会出现未知单词的KeyError ,因为它总是会从它在训练期间学到的字符 n-gram(单词片段)合成一个向量. You should probably just drop your get_vect() function entirely - relying just on normal [] -access.您可能应该完全放弃get_vect() function - 仅依赖于正常的[] -access。

Further, Gensim's KeyedVector models already support returning multiple results when indexed by a list of multiple keys.此外,Gensim 的KeyedVector模型已经支持在由多个键的列表索引时返回多个结果。 And, the numpy np.sum() might work a slight bit faster on these arrays than the pure-Python sum() .而且, numpy np.sum()在这些 arrays 上的运行速度可能比纯 Python sum()快一点。 So you might get a small speedup if you replace your sum_vectors() with:因此,如果将sum_vectors()替换为:

def sum_vectors(phrase, model):
    return np.sum(model.wv[phrase], axis=0)

To optimize further, you might need to profile the code in a heavy-usage loop, or even reconsider whether this is the form of text-vectorization you want to pursue.要进一步优化,您可能需要分析大量使用循环中的代码,或者甚至重新考虑这是否是您想要追求的文本矢量化形式。 (Though, better methods typically require more calculation than this simple sum/average.) (不过,更好的方法通常需要比这个简单的总和/平均值更多的计算。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据 - How to preprocess a huge dataset and save it such that I can train the data in Python 在分析庞大的数据集并使用 plotly 绘图时,如何减小 Jupyter Notebook 的大小? - How can i reduce the size of my Jupyter Notebook where i am analyzing a huge dataset and plotting using plotly? 我如何优化此代码以应对更大的数据集? - How can I optimize this code to cope with a larger dataset? 如何优化我的代码? (建议) - How can i optimize my codes ? (Suggestions) 如何优化大量数据的 postgres 插入/更新请求? - How can I optimize postgres insert/update request of huge amount of data? 如何为我的西班牙语翻译程序优化我的代码? - How can I optimize my code for my Spanish Translation Program? 如何为我的数据集创建箱线图? (需要数据转换) - How to create a boxplot for my dataset? (data transformation needed) Python算法:算法的复杂性是什么?如何优化它? - Python algorithm: What is the complexity of my algorithm and how can I optimize it? 如何通过减少循环和条件来优化我的代码 - How can i optimize my code by reducing the looping and conditioning 使用边界时如何优化我的鼻窦拟合? - How can i optimize my sinus fitting when using bounds?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM