简体   繁体   中英

How add new samples to the same label using Naive Bayes on php-ml?

I am newbie on Text Classification and I am trying to create some proof-of-concepts to understand better the concepts of ML using PHP. So I got this example , and I've tried to add a new small text to "reinforce" one of my labels (categories), in this case, Japan :

<?php
include_once './vendor/autoload.php';
//source: https://www.softnix.co.th/2018/08/19/naive-bays-text-classification-with-php/
use Phpml\Classification\NaiveBayes;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
use Phpml\Tokenization\WordTokenizer;
use Phpml\FeatureExtraction\TfIdfTransformer;

$arr_text = [
    "London bridge is falling down",
    "japan samurai Universal Studio spider man",
    "china beijing",
    "thai Chiangmai",
    "Universal Studio Hollywood",
    "2020 Olympic games"
];
$arr_label = [
    "London","Japan","China","Thailand","USA","Japan"
];

$tokenize = new WordTokenizer();
$vectorizer = new TokenCountVectorizer($tokenize);

$vectorizer->fit($arr_text);
$vocabulary = $vectorizer->getVocabulary();
$arr_transform = $arr_text;
$vectorizer->transform($arr_transform);

$transformer = new TfIdfTransformer($arr_transform);
$transformer->transform($arr_transform);

$classifier = new NaiveBayes();
$classifier->train($arr_transform, $arr_label);

$arr_testset = [
    'Hello Chiangmai I am Siam',
    'I want to go Universal Studio',
    'I want to go Universal Studio because I want to watch spider man',
    'Sonic in 2020'
];

$vectorizer->transform($arr_testset);
$transformer->transform($arr_testset);
$result = $classifier->predict($arr_testset);
var_dump($result);

The problem is, after added Japan again on array of labels, the result was:

array (size=4)
  0 => string 'Japan' (length=5)
  1 => string 'Japan' (length=5)
  2 => string 'Japan' (length=5)
  3 => string 'Japan' (length=5)

But I was expecting:

array (size=4)
  0 => string 'Thailand' (length=8)
  1 => string 'USA' (length=3)
  2 => string 'Japan' (length=5)
  3 => string 'Japan' (length=5)

So, How add new samples to the same label?

There are two problems with your training dataset:

  1. It is too small and not representative enough
  2. You gave twice more data when training your Japan label comparing with other labels

So, Japan label's model is trained on two sentences whose words are completely non-related and do not repeat. Other labels are trained on just one short sentence.

This leads to underfitted Japan label model that has "not learned enough" from the training data, and is not able to model the training data properly nor generalize to new data. In other words, it is too general and triggers on almost any sentence. Rest labels' models are overfitted - they model the training data too well and trigger only on those sentences that are very close to training set data.

So Japan label catches almost any sentence. And going in the begin of your labels list, it catches all sentences before any label that goes after it in list has a change to evaluate a sentence. Of course you can move Japan labels at the end of the list, but the better solution is - to enlarge your training data set for all labels.

You can also evaluate overfitted label model effect - try for example add to your test set "London bridge down" and "London down" sentences - the first gives you London , the second - Japan , because the first sentence is close enough to the sentence training set for London label and the second - isn't.

So keep adding the training set data exactly in this manner, just make your training set big and representative enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM