简体   繁体   English

可变序列长度数据的分类

[英]Classification of variable sequence length data

I have this data with varying number of elements in a row 我的数据中连续有不同数量的元素

sample feat1  feat2 feat3 feat4 feat5 feat6 feat7
 1       1      200  250    312   474  
 1       2      170  280    370
 ...
 1       12     220  400    470   520  620   720
 2       1      130  320    430   580  612   
 ...
 N       12     70   180    270   410

I found this sequence classification 我发现这个序列分类

from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
numpy.random.seed(7)
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, nb_epoch=3, batch_size=64)

Can I use this or modify to use it? 我可以使用它还是可以修改使用它? Some direction would be nice. 一些方向会很好。

Also, if you have better suggestion which algorithm to use or how to do it please suggest. 另外,如果您有更好的建议,请使用哪种算法或如何做。

A general approach would be to specify a specific value that means "unknown". 一般的方法是指定一个表示“未知”的特定值。 For example, if all your values are positive you can pick it to be -1. 例如,如果您所有的值均为正,则可以将其选择为-1。

sample feat1  feat2 feat3 feat4 feat5 feat6 feat7
 1       1      200  250    312   474    -1    -1
 1       2      170  280    370    -1    -1    -1
 ...
 1       12     220  400    470   520   620   720
 2       1      130  320    430   580   612    -1  
 ...
 N       12     70   180    270   410    -1    -1

The network then learns to ignore this value. 然后网络学会忽略该值。

There is even a built in function called pad_sequences that does this for you. 甚至还有一个名为pad_sequences的内置函数可以为您完成此操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM