简体   繁体   English

Python mlpy文本分类

[英]Python mlpy Classification of text

I'm new to mlpy library and looking for the best way to implement a sentences classification. 我是mlpy库的新手,正在寻找实现句子分类的最佳方法。 I was thinking to use the mply Basic Perceptron to do it but from my understanding it is using pre-defined vector size but i need the size of the vector to be dynamically increased while the machine is learning because I wouldn't like to create a huge vector (of all English words). 我本来打算使用mply Basic Perceptron来做,但据我了解,它正在使用预定义的向量大小,但是我需要在机器学习时动态增加向量的大小,因为我不想创建一个(所有英语单词的)巨大向量。 What i actually need to do is to get a list of sentences and build a classifier vector from them and then when the application will get new sentence it will try to classify it automatically to one of the labels (supervised learning). 我实际上需要做的是获取一个句子列表并从中建立一个分类器向量,然后当应用程序获取新句子时,它将尝试将其自动分类为标签之一(监督学习)。

Any ideas, thought and examples will be very helpful, 任何想法,想法和例子都将非常有帮助,

Thanks 谢谢

  1. If you have all the sentences beforehand, you can prepare a list of words (removing stop words) to map every word to a feature. 如果您事先拥有所有句子,则可以准备单词列表(删除停用词)以将每个单词映射到功能。 The size of the vector would be the number of words in the dictionary. 向量的大小将是字典中单词的数量。

  2. Once, you have that, you can train a perceptron. 一旦有了,就可以训练感知器。

Have a look at my code in which I did the mapping in Perl followed by perceptron implementation in matlab to understand how it works and write a similar implementation in python 看看我的代码,其中我在Perl中进行了映射,然后在matlab中进行了perceptron实现,以了解其工作原理并在python中编写了类似的实现

Preparing the bag of words model (Perl) 准备单词袋模型(Perl)

use warnings;
use strict;

my %positions = ();
my $n = 0;
my $spam = -1;

open (INFILE, "q4train.dat");
open (OUTFILE, ">q4train_mod.dat");
while (<INFILE>) {
    chomp;
    my @values = split(' ', $_);
    my %frequencies = ();
    for (my $i = 0; $i < scalar(@values); $i = $i+2) {
        if ($i==0) {
            if ($values[1] eq 'spam') {
                $spam = 1;
            }
            else {
                $spam = -1;
            }
        }
        else {
            $frequencies{$values[$i]} = $values[$i+1];
            if (!exists ($positions{$values[$i]})) {
                $n++;
                $positions{$values[$i]} = $n;   
            }
        }
    }
    print OUTFILE $spam." ";
    my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
    foreach my $word (@keys) {
        if (exists ($frequencies{$word})) {
            print OUTFILE " ".$positions{$word}.":".$frequencies{$word};
        }
    }
    print OUTFILE "\n";
}
close (INFILE);
close (OUTFILE);

open (INFILE, "q4test.dat");
open (OUTFILE, ">q4test_mod.dat");
while (<INFILE>) {
    chomp;
    my @values = split(' ', $_);
    my %frequencies = ();
    for (my $i = 0; $i < scalar(@values); $i = $i+2) {
        if ($i==0) {
            if ($values[1] eq 'spam') {
                $spam = 1;
            }
            else {
                $spam = -1;
            }
        }
        else {
            $frequencies{$values[$i]} = $values[$i+1];
            if (!exists ($positions{$values[$i]})) {
                $n++;
                $positions{$values[$i]} = $n;
            }
        }
    }
    print OUTFILE $spam." ";
    my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
    foreach my $word (@keys) {
        if (exists ($frequencies{$word})) {
            print OUTFILE " ".$positions{$word}.":".$frequencies{$word};
        }
    }
    print OUTFILE "\n";
}
close (INFILE);
close (OUTFILE);

open (OUTFILE, ">wordlist.dat");
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
    print OUTFILE $word."\n";
}

Perceptron Implementation (Matlab) Perceptron实施(Matlab)

clc; clear; close all;

[Ytrain, Xtrain] = libsvmread('q4train_mod.dat');
[Ytest, Xtest] = libsvmread('q4test_mod.dat');

mtrain = size(Xtrain,1);
mtest = size(Xtest,1);
n = size(Xtrain,2);

% part a
% learn perceptron
Xtrain_perceptron = [ones(mtrain,1) Xtrain];
Xtest_perceptron = [ones(mtest,1) Xtest];
alpha = 0.1;
%initialize
theta_perceptron = zeros(n+1,1);
trainerror_mag = 100000;
iteration = 0;
%loop
while (trainerror_mag>1000)
    iteration = iteration+1;
    for i = 1 : mtrain
        Ypredict_temp = sign(theta_perceptron'*Xtrain_perceptron(i,:)');
        theta_perceptron = theta_perceptron + alpha*(Ytrain(i)-Ypredict_temp)*Xtrain_perceptron(i,:)';
    end
    Ytrainpredict_perceptron = sign(theta_perceptron'*Xtrain_perceptron')';
    trainerror_mag = (Ytrainpredict_perceptron - Ytrain)'*(Ytrainpredict_perceptron - Ytrain)
end
Ytestpredict_perceptron = sign(theta_perceptron'*Xtest_perceptron')';
testerror_mag = (Ytestpredict_perceptron - Ytest)'*(Ytestpredict_perceptron - Ytest)

I don't want to code the same thing in Python again but this should give you a direction on how to proceed 我不想再次用Python编写相同的代码,但这应该为您提供如何进行操作的指导

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM