简体   繁体   English

如何将数据集转换为 glove 或 word2vec 格式?

[英]How can I Convert a dataset to glove or word2vec format?

I have my twitter archive downloaded and wanted to run word2vec to experiment most similar words, analogies etc on it.我下载了我的 twitter 存档并想运行 word2vec 来试验最相似的单词、类比等。

But I am stuck at first step - how to convert a given dataset / csv / document so that it can be input to word2vec?但我被困在第一步 - 如何转换给定的数据集 / csv / 文档,以便它可以输入到 word2vec? ie what is the process to convert data to glove/word2vec format?即,将数据转换为 glove/word2vec 格式的过程是什么?

Typically implementations of the word2vec & GLoVe algorithms do one or both of: word2vec 和 GLoVe 算法的典型实现会执行以下一项或两项操作:

  • accept a plain text file, where tokens are delimited by (one or more) spaces, and text is considered each newline-delimited line at a time (with lines that aren't "too long" - usually, short-article or paragraph or sentence per line)接受纯文本文件,其中标记由(一个或多个)空格分隔,并且一次将文本视为每个换行符分隔的行(行不是“太长” - 通常是短文章或段落或每行句子)

  • have some language/library-specific interface for feeding texts (lists-of-tokens) to the algorithm as a stream/iterable有一些特定于语言/库的接口,用于将文本(令牌列表)作为流/可迭代提供给算法

The Python Gensim library offers both options for its Word2Vec class. Python Gensim 库为其Word2Vec class 提供了两种选项。

You should generally try working through one or more tutorials to get a working overview of the steps involved, from raw data to interesting results, before applying such libraries to your own data.在将此类库应用于您自己的数据之前,您通常应该尝试通过一个或多个教程来获得所涉及步骤的工作概述,从原始数据到有趣的结果。 And, by examining the formats used by those tutorials – and the extra steps they perform to massage the data into the formats needed by exactly the libraries you're using – you'll also see ideas for how your data needs to be prepared.而且,通过检查这些教程使用的格式——以及他们执行的额外步骤,将数据转化为你正在使用的库所需的格式——你还将看到如何准备数据的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM