简体繁体 English

创建一个python算法来训练keras模型以预测大整数序列

[英]Creating a python algorithm to train a keras model to predict a large sequence of integers

原文 2019-09-04 09:45:43 3 1 python/ tensorflow/ machine-learning/ keras

I'm new to machine learning but I'm trying to apply it to a project I have. 我是机器学习的新手，但正在尝试将其应用于我拥有的项目。 I was able to train a model to convert words from one language to another using LSTM layers. 我能够训练一个模型，使用LSTM层将单词从一种语言转换为另一种语言。 Say I use A as input to my model and I get B as output. 假设我将A用作模型的输入，而将B用作输出。 What I do is: 我要做的是：

'original word' -> word embedding -> one-hot encode (A) -> MODEL -> one-hot encoded output (B) -> word embedding -> 'translated word' '原始单词'->单词嵌入->一键编码（A）->模型->一键编码输出（B）->单词嵌入->'翻译后的单词'

This is relatively simple as I'm using a character-level tokenizer to encode the words and that does not require much memory (small sequences, one for each word). 这是相对简单的，因为我正在使用字符级标记器对单词进行编码，并且不需要太多内存（小的序列，每个单词一个）。

However, I now have to train a model that takes B as input and gives me C (no longer a translation problem). 但是，我现在必须训练一个以B作为输入并给我C（不再是翻译问题）的模型。 C is later going to be used for different purposes. C稍后将用于不同的目的。 The difference is that C can have a length of say 315 numbers and each of them can be one of 5514 unique values ie, shape(215, 5514). 区别在于C的长度可以说315个数字，并且每个数字可以是5514个唯一值之一，即shape（215，5514）。 Generically what I want to do is, for example: 一般来说，我想做的是：

'banana' -> (some processing, word embedding or one-hot) -> MODEL -> [434, 434, 410, 321, 225, 146, 86, 43, 13, -8, -23, -32, -38, -41, -13, 101, 227, 332, 411, 470, 515, 550, 577, 597, 611, 622, 628, 622, 608, 593, 580, 570, 561, 554, 549, 547, 548, 548, 549, 555, 564, 572, 579, 584, 587, 589, 590, 591, 591, 591, 590, 590, 584, 567, 550, 535, 524, 516, 511, 506, 503, 503, 507, 511, 518, 530, 543, 553, 561, 568, 573, 577, 580, 582, 584, 585, 586, 586, 587, 587, 588, 588, 588, 588, 588, 586] '香蕉'->（某些处理，词嵌入或单键处理）->模型-> [434、434、410、321、225、146、86、43、13，-8，-23，-32，- 38，-41，-13、101、227、332、411、470、515、550、577、597、611、622、628、622、608、593、580、570、561、554、549、547， 548、548、549、555、564、572、579、584、587、589、590、591、591、591、590、590、590、584、567、550、535、524、516、511、506、503， 503、507、511、518、530、543、553、561、568、573、577、580、582、584、585、586、586、587、587、587、588、588、588、588、588、586]

So the problem is that I don't have enough memory to perform a one-hot encoding of the output sequences. 因此，问题在于我没有足够的内存来对输出序列执行单次热编码。 I tried using generators to load each sequence from the disk instead of loading all of them from memory but It doesn't seem to be working. 我尝试使用生成器从磁盘加载每个序列，而不是从内存加载所有序列，但这似乎没有用。

Do you have any suggestions as to how I should approach this problem? 您对我应该如何解决此问题有任何建议吗？

EDIT: The dataset I'm using has the following format: n lines, each line contains 2 columns separated by a tab. 编辑：我正在使用的数据集具有以下格式：n行，每行包含2列，由一个标签分隔。 The first column is the input word and the second column is the sequence I want to obtain if the input is that word. 如果输入是那个单词，第一列是输入单词，第二列是我想要获得的序列。

1 个解决方案

One hot encoding increases number of columns according to unique categories in data set. 一种热编码根据数据集中的唯一类别增加列数。 I think you should check the performance of model with just using tokenizer not both. 我认为您应该仅使用令牌化器而不是同时使用两者来检查模型的性能。 Because most of the time tokenizer alone performs very well. 因为在大多数情况下，令牌生成器本身的性能都很好。