简体   繁体   English

使用更多2个输入功能训练神经网络

[英]Train Neural network with more 2 input features

I already referred to the keras guide on using multiple inputs . 我已经提到了使用多个输入keras指南 However, I am still stumped as I am new to RNNs and CNNs. 但是,我仍然难以接受,因为我是RNN和CNN的新手。 I am working with keras to train a neural network classifier. 我正在与keras一起训练神经网络分类器。 In my csv file, I have 3 features. 在我的csv文件中,我有3个功能。

  • Sentence 句子
  • Probability 可能性
  • Target 目标

Each sentence is a sentence with exactly 5 words and there are 1860 such sentences. 每个句子都是一个恰好有5个单词的句子,有1860个这样的句子。 The probability is a float value in the range of [0,1] and the target is the field that needs to be predicted (0 or 1). 概率是[0,1]范围内的浮点值,目标是需要预测的字段(0或1)。

I first randomly initiate the sentences with embeddings as shown below. 我首先用嵌入方式随机启动句子,如下所示。

import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
import gensim
import pandas as pd
import os
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from gensim.models import Word2Vec, KeyedVectors
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding
from keras.initializers import Constant
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from termcolor import colored
from keras.utils import to_categorical
import tensorflow as tf

import warnings
warnings.filterwarnings("ignore")

nltk.download('stopwords')
# one hot encode

seed = 42
np.random.seed(seed)
tf.set_random_seed(seed)


df = pd.DataFrame()
df = pd.read_csv('../../data/sentence_with_stv.csv')
sentence_lines = list()
lines = df['sentence'].values.tolist()
stv = df['stv'].values.tolist()

for line in lines:
    tokens = word_tokenize(line)
    tokens = [w.lower() for w in tokens]
    table = str.maketrans('','',string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    words = [word for word in stripped if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    sentence_lines.append(words)

print('Number of lines', len(sentence_lines)))
EMBEDDING_DIM = 200

#Vectorize the text samples into a S2 integer tensor
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(sentence_lines)
sequences = tokenizer_obj.texts_to_sequences(sentence_lines)

print(colored(sequences,'green'))

This gives me an output such as, 这给了我一个输出,如,

Number of lines: 1860
[[2, 77, 20, 17, 81], 
 [12, 21, 17, 82], 
 [2, 83, 20, 17, 82], 
 [2, 20, 17, 43], 
 [12, 21, 17, 81], 
 ...

Now, I need to append the probability to each of of these lines such that the new sequence should resemble the following. 现在,我需要将概率附加到这些行中的每一行,以便新序列应类似于以下内容。

[[2, 77, 20, 17, 81, 0.456736827], 
 [12, 21, 17, 82, 0.765142873], 
 [2, 83, 20, 17, 82, 0.335627635], 
 [2, 20, 17, 43, 0.5453652], 
 [12, 21, 17, 81, 0.446739202],
 ...

I tried taking each of the sequence's row and appending the probability as, 我尝试取每个序列的行并将概率追加为,

sequence[x] = np.append(sequence[x], probability[x], axis=1)

where, the probability is an array of the same size, 1860, consisting of only the probability values. 其中,概率是一个相同大小的数组,1860,仅由概率值组成。 After doing this for all the rows, I print a single row to check if the value got appended. 对所有行执行此操作后,我打印一行以检查是否附加了值。 But, I get an output as shown below. 但是,我得到如下所示的输出。

[2.     77.     20.     17.     81.     0.456736827]

Any suggestions in this regard will be much appreciated. 在这方面的任何建议将不胜感激。

You should use word indices as an input to a neural net. 您应该使用单词索引作为神经网络的输入。 Each index corresponds to different words and the indexing does not have the semantics of numbers. 每个索引对应不同的单词,索引不具有数字的语义。 (10 is twice as much as 5, nothing like this holds for categorical variables.) If append a float to the indices in numpy, the indices get converted into floats. (10是5的两倍,对于分类变量,没有类似的东西。)如果将一个浮点数附加到numpy中的索引,则索引将转换为浮点数。

The correct solution is using an embedding layer for the word inputs. 正确的解决方案是使用嵌入层进行单词输入。 Embedding layer assigns a trainable vector to every item in your dictionary. 嵌入图层为您词典中的每个项目指定一个可训练的矢量。 Embeddings are typically followed by an RNN or a CNN to get o single vector, or you can just concatenate the embeddings. 嵌入通常后跟RNN或CNN以获得单个向量,或者您可以只连接嵌入。

Anyway, you cannot easily combine the inputs in numpy, you need to do it TensorFlow. 无论如何,你不能轻易地将输入结合在numpy中,你需要做TensorFlow。 First, embed the words and once you have a tensor of continuous values, only then you can append you numerical features. 首先,嵌入单词,一旦你有一个连续值的张量,只有这样你才能追加数字特征。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM