简体   繁体   English

从Vowpal Wabbit的内存中读取数据?

[英]Read data from memory in Vowpal Wabbit?

Is there a way to send data to train a model in Vowpal Wabbit without writing it to disk? 有没有办法发送数据来训练Vowpal Wabbit中的模型而无需将其写入磁盘?

Here's what I'm trying to do. 这就是我想要做的。 I have a relatively large dataset in csv (around 2gb) which fits in memory with no problem. 我在csv中有一个相对较大的数据集(大约2gb),它在内存中没有问题。 I load it in R into a data frame, and I have a function to convert the data in that dataframe into VW format. 我将它加载到R中的数据框中,我有一个函数将该数据帧中的数据转换为VW格式。

Now, in order to train a model, I have to write the converted data to a file first, and then feed that file to VW. 现在,为了训练模型,我必须首先将转换后的数据写入文件,然后将该文件提供给VW。 And the writing to disk part takes way too long, especially since I want to try different various models with different feature transformations, and thus I have to write the data to disk multiple times. 写入磁盘部分的时间太长,特别是因为我想尝试使用不同功能转换的不同各种模型,因此我必须多次将数据写入磁盘。

So, assuming I'm able to create a character vector in R, in which each element is a row of data in VW format, how could I feed that into VW without writing it to disk? 因此,假设我能够在R中创建一个字符向量,其中每个元素都是VW格式的一行数据,那么如何在不将其写入磁盘的情况下将其提供给大众?

I considered using the daemon mode and writing the character vector to a localhost connection, but I couldn't get VW to train in daemon mode -- I'm not sure this is even possible. 我考虑使用守护进程模式并将字符向量写入localhost连接,但我无法让VW以守护进程模式进行训练 - 我不确定这是否可行。

I'm willing to use c++ (through the Rcpp package) if necessary to make this work. 如果有必要,我愿意使用c ++(通过Rcpp包)来完成这项工作。

Thank you very much in advance. 非常感谢你提前。

UPDATE: 更新:

Thank you everyone for your help. 感谢大家的帮助。 In case anyone's interested, I just piped the output to VW as suggested in the answer, like so: 如果有人感兴趣,我只是按照答案中的建议将输出传输到大众,如下所示:

# Two sample rows of data
datarows <- c("1 |name 1:1 2:4 4:1", "-1 |name 1:1 4:1")
# Open connection to VW
con <- pipe("vw -f my_model.vw")
# Write to connection and close
writeLines(datarows, con)
close(con)

What you may be looking for is running vw in daemon mode. 您可能正在寻找的是在守护进程模式下运行vw

The standard way to do this is to run vw as a daemon: 执行此操作的标准方法是将vw作为守护程序运行:

vw -i some.model --daemon --quiet --port 26542 -p /dev/stdout

You may replace 26542 by the port of your choice. 您可以通过您选择的端口替换26542

Now you can TCP connect to the server (which can be localhost , on port 26542 ) and every request you write to the TCP socket, will be responded to on the same socket. 现在您可以通过TCP连接到服务器(可以是localhost ,在端口26542 ),并且您写入TCP套接字的每个请求都将在同一个套接字上响应。

You can both learn (send labeled examples, which will change the model in real-time) or write queries and read back responses. 您既可以学习(发送带标签的示例,也可以实时更改模型),也可以编写查询并回读回复。

You can do it either one query+prediction at a time or many at a time. 您可以一次执行一个查询+预测,也可以一次执行多个查询。 All you need is a newline char at the end of each query, exactly as you would test from a file. 您只需要在每个查询结束时使用换行符,就像从文件中测试一样。 Order is guaranteed to be preserved. 保证订单得以保留。

You can also intermix requests to learn from with requests that are intended only for prediction and are not supposed to update the in memory model. 您还可以混合请求以学习仅用于预测的请求,并且不应更新内存模型。 The trick to achieve this is to use a zero-weight for examples you don't want to be learned from. 实现这一目标的诀窍是使用零重量作为您不想学习的示例。

This example will update the model because it has a weight of 1: 此示例将更新模型,因为它的权重为1:

label 1 'tag1| input_features...

And this one won't update the model because it has a weight of 0: 并且这个不会更新模型,因为它的权重为0:

label 0 'tag2| input_features...

A bit more in the official reference is in the vowpal wabbit wiki: How to run vowpal wabbit as a daemon although note that in that main example a model is pre-learned and loaded into memory. 官方参考中的更多内容是在vowpal wabbit wiki中: 如何将vowpal wabbit作为守护进程运行,但请注意,在该主要示例中,模型已预先学习并加载到内存中。

Vowpal Wabbit supports reading data from standard input (cat train.dat | vw), so you can open a pipe directly from R. Vowpal Wabbit支持从标准输入(cat train.dat | vw)读取数据,因此您可以直接从R打开管道。

Daemon mode supports training. 守护进程模式支持培训。 If you need incremental/contiguous learning, you can use a trick with a dummy example whose tag starts with string "save". 如果您需要增量/连续学习,您可以使用一个虚拟示例的技巧,其标记以字符串“save”开头。 Optionally you can specify the model filename as well: 您也可以选择指定模型文件名:

1 save_filename| 

Yet another option is to use VW as library, see an example . 另一个选择是使用VW作为库,请参阅示例

Note that VW supports various feature engineering using feature namespaces. 请注意,VW支持使用功能命名空间的各种功能工程。

I am also using R to transform data and output them to VowpalWabbit. 我也使用R来转换数据并将它们输出到VowpalWabbit。 There exists RVowpalWabbit package on CRAN which can be used to connect R with VowpalWabbit. CRAN上存在RVowpalWabbit包,可以用来连接R和VowpalWabbit。 However, it is only available on Linux. 但是,它仅适用于Linux。

Also, to speed things up, I use fread function of data.table package. 另外,为了加快速度,我使用了data.table包的fread功能。 Transformations of data.table are also quicker than in data.frame , but one needs to learn a different syntax. data.table转换也比data.frame快,但需要学习不同的语法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM