简体   繁体   English

如何在 php-ml 上使用朴素贝叶斯将新样本添加到同一标签?

[英]How add new samples to the same label using Naive Bayes on php-ml?

I am newbie on Text Classification and I am trying to create some proof-of-concepts to understand better the concepts of ML using PHP.我是文本分类的新手,我正在尝试创建一些概念验证以更好地理解使用 PHP 的 ML 概念。 So I got this example , and I've tried to add a new small text to "reinforce" one of my labels (categories), in this case, Japan :所以我得到了这个例子,我试图添加一个新的小文本来“加强”我的一个标签(类别),在这种情况下,日本

<?php
include_once './vendor/autoload.php';
//source: https://www.softnix.co.th/2018/08/19/naive-bays-text-classification-with-php/
use Phpml\Classification\NaiveBayes;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
use Phpml\Tokenization\WordTokenizer;
use Phpml\FeatureExtraction\TfIdfTransformer;

$arr_text = [
    "London bridge is falling down",
    "japan samurai Universal Studio spider man",
    "china beijing",
    "thai Chiangmai",
    "Universal Studio Hollywood",
    "2020 Olympic games"
];
$arr_label = [
    "London","Japan","China","Thailand","USA","Japan"
];

$tokenize = new WordTokenizer();
$vectorizer = new TokenCountVectorizer($tokenize);

$vectorizer->fit($arr_text);
$vocabulary = $vectorizer->getVocabulary();
$arr_transform = $arr_text;
$vectorizer->transform($arr_transform);

$transformer = new TfIdfTransformer($arr_transform);
$transformer->transform($arr_transform);

$classifier = new NaiveBayes();
$classifier->train($arr_transform, $arr_label);

$arr_testset = [
    'Hello Chiangmai I am Siam',
    'I want to go Universal Studio',
    'I want to go Universal Studio because I want to watch spider man',
    'Sonic in 2020'
];

$vectorizer->transform($arr_testset);
$transformer->transform($arr_testset);
$result = $classifier->predict($arr_testset);
var_dump($result);

The problem is, after added Japan again on array of labels, the result was:问题是,在标签数组上再次添加日本后,结果是:

array (size=4)
  0 => string 'Japan' (length=5)
  1 => string 'Japan' (length=5)
  2 => string 'Japan' (length=5)
  3 => string 'Japan' (length=5)

But I was expecting:但我期待:

array (size=4)
  0 => string 'Thailand' (length=8)
  1 => string 'USA' (length=3)
  2 => string 'Japan' (length=5)
  3 => string 'Japan' (length=5)

So, How add new samples to the same label?那么,如何将新样本添加到同一个标签?

There are two problems with your training dataset:您的训练数据集有两个问题:

  1. It is too small and not representative enough太小了,不够有代表性
  2. You gave twice more data when training your Japan label comparing with other labels与其他标签相比,您在训练Japan标签时提供了两倍的数据

So, Japan label's model is trained on two sentences whose words are completely non-related and do not repeat.因此, Japan label 的模型是在两个词完全不相关且不重复的句子上训练的。 Other labels are trained on just one short sentence.其他标签仅在一个短句上进行训练。

This leads to underfitted Japan label model that has "not learned enough" from the training data, and is not able to model the training data properly nor generalize to new data.这导致拟合的Japan标签模型从训练数据中“学习得不够”,无法对训练数据进行正确建模,也无法推广到新数据。 In other words, it is too general and triggers on almost any sentence.换句话说,它太笼统了,几乎可以触发任何句子。 Rest labels' models are overfitted - they model the training data too well and trigger only on those sentences that are very close to training set data.休息标签的模型过度拟合——它们对训练数据建模太好,并且只在那些非常接近训练集数据的句子上触发。

So Japan label catches almost any sentence.所以Japan标签几乎可以捕捉任何句子。 And going in the begin of your labels list, it catches all sentences before any label that goes after it in list has a change to evaluate a sentence.并进入标签列表的开头,它会在列表中任何跟在它之后的标签发生更改以评估句子之前捕获所有句子。 Of course you can move Japan labels at the end of the list, but the better solution is - to enlarge your training data set for all labels.当然,您可以移动列表末尾的Japan标签,但更好的解决方案是 - 扩大所有标签的训练数据集。

You can also evaluate overfitted label model effect - try for example add to your test set "London bridge down" and "London down" sentences - the first gives you London , the second - Japan , because the first sentence is close enough to the sentence training set for London label and the second - isn't.您还可以评估过度拟合的标签模型效果 - 例如尝试将“London bridge down”和“London down”句子添加到您的测试集 - 第一个给你London ,第二个 - Japan ,因为第一个句子与句子足够接近London标签的训练集和第二个 - 不是。

So keep adding the training set data exactly in this manner, just make your training set big and representative enough.因此,请继续以这种方式准确添加训练集数据,只要使您的训练集足够大且具有代表性即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM