简体   繁体   中英

How to train ML model on 2 columns to solve for classification?

I have three columns in a dataset on which I'm doing sentiment analysis(classes 0 , 1 , 2 ):

text    thing    sentiment

But the problem is that I can train my data only on either text or thing and get predicted sentiment . Is there a way to train the data both on text & thing and then predict sentiment ?

Problem case(say):

  |text  thing  sentiment
0 | t1   thing1    0
. |
. |
54| t1   thing2    2

This example tells us that sentiment shall depend on the thing as well. If I try to concatenate the two columns one below the other and then try but that would be incorrect as we wouldn't be giving any relationship between the two columns to the model.

Also my test set contains two columns test and thing for which I've to predict the sentiment according to the trained model on the two columns.

Right now I'm using the tokenizer and then the model below:

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Any pointers on how to proceed or which model or coding manipulation to use ?

You may want to shift to the Keras functionnal API and train a multi-input model.

According to Keras's creator, François CHOLLET, in his book Deep Learning with Python [Manning, 2017] (chapter 7, section 1) :

Some tasks, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item's brand, age, and so on), a user-provided text description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.

I think the Concatenate functionality is the way to get in such a case and the general idea should be as follows. Please tweak it according to your use case.

### whatever preprocessing you may want to do
text_input = Input(shape=(1, ))
thing_input = Input(shape=(1,))

### now bring them together
merged_inputs = Concatenate(axis = 1)([text_input, thing_input])

### sample output layer
output = Dense(3)(merged_inputs)

### pass your inputs and outputs to the model
model = Model(inputs = [text_input, thing_input], outputs = output)

You have to take multiple column as list and then merge to train after embedding and pre processing on the raw data. Example:

train = pd.read_csv('COVID19 multifeature Emotion - 50 data.csv', nrows=49)
# This dataset has two text column field and different class level

X_train_doctor_opinion = train["doctor-opinion"].str.lower()
X_train_patient_opinion = train["patient-opinion"].str.lower()

X_train = list(X_train_doctor_opinion) + list(X_train_patient_opinion))

Then pre process and embed

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM