简体繁体 English

如何将数据集转换为 glove 或 word2vec 格式？

[英]How can I Convert a dataset to glove or word2vec format?

原文 2020-12-11 16:24:54 7 1 python/ nlp/ stanford-nlp/ word2vec

I have my twitter archive downloaded and wanted to run word2vec to experiment most similar words, analogies etc on it.我下载了我的 twitter 存档并想运行 word2vec 来试验最相似的单词、类比等。

But I am stuck at first step - how to convert a given dataset / csv / document so that it can be input to word2vec?但我被困在第一步 - 如何转换给定的数据集 / csv / 文档，以便它可以输入到 word2vec？ ie what is the process to convert data to glove/word2vec format?即，将数据转换为 glove/word2vec 格式的过程是什么？

1 个解决方案

Typically implementations of the word2vec & GLoVe algorithms do one or both of: word2vec 和 GLoVe 算法的典型实现会执行以下一项或两项操作：

accept a plain text file, where tokens are delimited by (one or more) spaces, and text is considered each newline-delimited line at a time (with lines that aren't "too long" - usually, short-article or paragraph or sentence per line)接受纯文本文件，其中标记由（一个或多个）空格分隔，并且一次将文本视为每个换行符分隔的行（行不是“太长” - 通常是短文章或段落或每行句子）
have some language/library-specific interface for feeding texts (lists-of-tokens) to the algorithm as a stream/iterable有一些特定于语言/库的接口，用于将文本（令牌列表）作为流/可迭代提供给算法

The Python Gensim library offers both options for its Word2Vec class. Python Gensim 库为其Word2Vec class 提供了两种选项。

You should generally try working through one or more tutorials to get a working overview of the steps involved, from raw data to interesting results, before applying such libraries to your own data.在将此类库应用于您自己的数据之前，您通常应该尝试通过一个或多个教程来获得所涉及步骤的工作概述，从原始数据到有趣的结果。 And, by examining the formats used by those tutorials – and the extra steps they perform to massage the data into the formats needed by exactly the libraries you're using – you'll also see ideas for how your data needs to be prepared.而且，通过检查这些教程使用的格式——以及他们执行的额外步骤，将数据转化为你正在使用的库所需的格式——你还将看到如何准备数据的想法。