简体   繁体   中英

perceptron classifying

I have a folder that contains 7 sub folder and each sub folder contains 8 files. Generally I have 56 files for the train set. For the test set, I have a folder that contains 7 sub folders and each sub folder contains 2 files (generally 14 files for the test set). I have another file which contains 1000 most common words of the train set. I have to check if these 1000 words are in the train set or no. If they exist there, it should return +1, else it should return -1 to make a vector.Then I have to classify the texts with bipolar perceptron (neural network). The threshold is, 0.1 and the learning rate is, 0.5. The part after assigning the weights doesn't work well. how can I change the code?

import os
file="c:/python34/1000CommonWords.txt"
folder_path="c:/python34/train"
def vector (folder_path, file):
    t=[]   
    vector=[]
    vector2=[]
    m=[]

    for folder in sorted(os.listdir(folder_path)):
        folder_path1 = os.path.join(folder_path, folder)
        for folder1 in sorted(os.listdir(folder_path1)):
            file=os.path.join(folder_path1, folder1)
            tex = open(file,encoding="utf-8") 
            tex=tex.read().split()
            t=tex+t






        with open (file, encoding="utf-8") as f1:
            f1=f1.read().split()


            for c in t:      # to make the [1, -1] vector
               for i in c:
                    for j in f1:
                        if j in i:
                            m.append (+1)
                        else:
                            m.append (-1)
                    vector.append(m)
                    vector2.append(vector)
                    #return vector2

                    w=[[0 for row in range(len(vector2[0][0] ))] for clmn in range(7)]   # weights
                    b=[0 for wb in range(7)]   # bias
                    l=0
                    while l<=1000:
                        w_old=w[:]
                        b_old=b[:]
                        for Class in vector2:
                            for text in Class:
                                node=0
                                while node<7:
                                    i=0
                                    y_in=0
                                    while i<len(text):
                                        y_in= text[i]*w[node][i]
                                        i+=1
                                        y_in=b[node]+y_in
                                        if y_in<=-0.1:  # activatin function
                                            y=-1
                                        elif (y_in <=0.1 and y_in>=-0.1):
                                            y=0
                                        else:
                                            y=1

                                        if node==vector2.index(Class):
                                            target=1  # assign target
                                        else:
                                            target=-1

                                        if target!=y:
                                            for j in range(0,len(w[0])): # update weights
                                                w[node][j]=w[nod][j]+0.5*text[j]*target
                                                b[node]=b[node]+0.5*target
                                              #  print(len(w)) 
                                               # print(len(w[0]))
                                                node+=1
                                                l+=1
                                                print (w)
                                                print(b)

the folders name:(the language is persian)

['اجتماعی', 'اديان', 'اقتصادی', 'سیاسی', 'فناوري', 'مسائل راهبردي ايران', 'ورزشی']

Each folder contains these files:

['13810320-txt-0132830_utf.txt', '13810821-txt-0172902_utf.txt', '13830627-txt-0431835_utf.txt', '13850502-txt-0751465_utf.txt', '13850506-txt-0754145_utf.txt', '13850723-txt-0802407_utf.txt', '13860630-txt-1002033_utf.txt', '13870730-txt-1219770_utf.txt'] ['13860431-txt-0963964_utf.txt', '13860616-txt-0992811_utf.txt', '13860625-txt-0997674_utf.txt', '13860722-txt-1013944_utf.txt', '13860802-txt-1021550_utf.txt', '13870329-txt-1149735_utf.txt', '13870903-txt-1240455_utf.txt', '13871001-txt-1256894_utf.txt'] ['13860321-txt-0940314_utf.txt', '13860930-txt-1055987_utf.txt', '13870504-txt-1169324_utf.txt', '13880223-txt-1337283_utf.txt', '13890626-txt-1614537_utf.txt', '13891005-txt-1681151_utf.txt', '13891025-txt-1694816_utf.txt', '13891224-txt-1732745_utf.txt'] ['13821109-txt-0342352_utf.txt', '13840501-txt-0558076_utf.txt', '13840725-txt-0599073_utf.txt', '13850728-txt-0809843_utf.txt', '13850910-txt-0834263_utf.txt', '13871015-txt-1264594_utf.txt', '13880304-txt-1345179_utf.txt', '138 90531-txt-1596470_utf.txt'] ['13850816-txt-0807093_utf.txt', '13850903-txt-0830601_utf.txt', '13851012-txt-0853818_utf.txt', '13870605-txt-1185666_utf.txt', '13890301-txt-1542795_utf.txt', '13890626-txt-1614287_utf.txt', '13890716-txt-1628932_utf.txt', '13900115-txt-1740412_utf.txt'] ['13870521-txt-1177039_utf.txt', '13870706-txt-1196885_utf.txt', '13870911-txt-1220118_utf.txt', '13871029-txt-1273519_utf.txt', '13880118-txt-1312303_utf.txt', '13880303-txt-1202027_utf.txt', '13880330-txt-1132374_utf.txt', '13880406-txt-1360964_utf.txt'] ['13840803-txt-0602704_utf.txt', '13841026-txt-0651073_utf.txt', '13880123-txt-1315587_utf.txt', '13880205-txt-1324336_utf.txt', '13880319-txt-1353520_utf.txt', '13880621-txt-1401062_utf.txt', '13890318-txt-1553380_utf.txt', '13890909-txt-1665470_utf.txt']

Okay here's the general rule for any classification task (in a nutshell): To classify anything (text, image, sound ...) you first need to extract features from the data points (in your case each text file). For your case, features are the 1000 words your mentioned. So each feature vector for each training case has a length of 1000. Then you feed those examples to your desired model (any kind of neural networks or any other model) and then get outputs for each class. Here you have to have a cost function which determines how much your model's outputs are deviating from the true label for each input example (each text file in your case) and minimize the cost function with respect to the parameters of the model.

There are quite a few implementations for the models of your desire. Once you construct your feature vectors you may use those publicly-available implementations.

Linear neural networks trained with the Perceptron learning rule : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

Neural networks using other activation functions and trained with gradient descent : http://scikit-learn.org/dev/modules/neural_networks_supervised.html

I suggest you use the type of neural networks trained with gradient descnet rather than Perceptrons neural networks. Perceptrons are only able to learn a linearly-separable classifier. Their assumption is that your input data is linearly-separable such as what has been shown below:

https://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png

The points in that graph are data points. However, in real-world scenarios most data points are not linearly separable. Just to give you an idea, sports text documents may share a lot of words with social documents. So you are better using a non-linear classifier such as neural networks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM