简体   繁体   中英

Python Machine Learning Trained Classifer Error index is out of bounds

I have a trained classifier that has been working fine.

I attempted to modify it to deal with multiple .csv files using a loop, however this has since broken it, to the point where the original code (that was working fine) now returns the same error with .csv files it previously processed without any issues.

I am very confused and can't see what would have suddenly caused this error to appear when everything was working fine before. The original (working) code was;

    # -*- coding: utf-8 -*-

    import csv
    import pandas
    import numpy as np
    import sklearn.ensemble as ske
    import re
    import os
    import collections
    import pickle
    from sklearn.externals import joblib
    from sklearn import model_selection, tree, linear_model, svm


    # Load dataset
    url = 'test_6_During_100.csv'
    dataset = pandas.read_csv(url)
    dataset.set_index('Name', inplace = True)
    ##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company',
    ##            'UserProcessorTime','Path','Product','Description',]]

    # Open file to output everything to
    new_url = re.sub('\.csv$', '', url)
    f = open(new_url + " output report", 'w')
    f.write(new_url + " output report\n")
    f.write("\n")


    # shape
    print(dataset.shape)
    print("\n")
    f.write("Dataset shape " + str(dataset.shape) + "\n")
    f.write("\n")

    clf = joblib.load(os.path.join(
            os.path.dirname(os.path.realpath(__file__)),
            'classifier/classifier.pkl'))


    Class_0 = []
    Class_1 = []
    prob = []

    for index, row in dataset.iterrows():
        res = clf.predict([row])
        if res == 0:
            if index in malware:
                Class_0.append(index)
            elif index in Class_1:
                Class_1.append(index)           
            else:
                print "Is ", index, " recognised?"
                designation = raw_input()

                if designation == "No":
                    Class_0.append(index)
                else:
                    Class_1.append(index)

    dataset['Type']  = 1                    
    dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0

    print "\n"

    results = []

    results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
    print (results)

    X = dataset.drop(['Type'], axis=1).values
    Y = dataset['Type'].values


    clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
    clf.fit(X, Y)
    joblib.dump(clf, 'classifier/classifier.pkl')

    output = collections.Counter(Class_0)

    print "Class_0; \n"
    f.write ("Class_0; \n")

    for key, value in output.items():    
        f.write(str(key) + " ; " + str(value) + "\n")
        print(str(key) + " ; " + str(value))

    print "\n"
    f.write ("\n") 

    output_1 = collections.Counter(Class_1)

    print "Class_1; \n"
    f.write ("Class_1; \n")

    for key, value in output_1.items():    
        f.write(str(key) + " ; " + str(value) + "\n")
        print(str(key) + " ; " + str(value))

    print "\n" 

    f.close()

My new code was the same, but wrapped inside a couple of nested loops, to keep the script running whilst there were files to process inside a folder, the new code (code which caused the error) is below;

# -*- coding: utf-8 -*-

import csv
import pandas
import numpy as np
import sklearn.ensemble as ske
import re
import os
import time
import collections
import pickle
from sklearn.externals import joblib
from sklearn import model_selection, tree, linear_model, svm

# Our arrays which we'll store our process details in and then later print out data for
Class_0 = []
Class_1 = []
prob = []
results = []

# Open file to output our report too
timestr = time.strftime("%Y%m%d%H%M%S")

f = open(timestr + " output report.txt", 'w')
f.write(timestr + " output report\n")
f.write("\n")

count = len(os.listdir('.'))

while (count > 0):
    # Load dataset
    for filename in os.listdir('.'):
            if filename.endswith('.csv') and filename.startswith("processes_"):

                url = filename

                dataset = pandas.read_csv(url)
                dataset.set_index('Name', inplace = True)

                clf = joblib.load(os.path.join(
                        os.path.dirname(os.path.realpath(__file__)),
                        'classifier/classifier.pkl'))               

                for index, row in dataset.iterrows():
                    res = clf.predict([row])
                    if res == 0:
                        if index in Class_0:
                            Class_0.append(index)
                        elif index in Class_1:
                            Class_1.append(index)           
                        else:
                            print "Is ", index, " recognised?"
                            designation = raw_input()

                            if designation == "No":
                                Class_0.append(index)
                            else:
                                Class_1.append(index)

                dataset['Type']  = 1                    
                dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0

                print "\n"

                results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
                print (results)

                X = dataset.drop(['Type'], axis=1).values
                Y = dataset['Type'].values


                clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
                clf.fit(X, Y)
                joblib.dump(clf, 'classifier/classifier.pkl')

                os.remove(filename) 


output = collections.Counter(Class_0)

print "Class_0; \n"
f.write ("Class_0; \n")

for key, value in output.items():    
    f.write(str(key) + " ; " + str(value) + "\n")
    print(str(key) + " ; " + str(value))

print "\n"
f.write ("\n") 

output_1 = collections.Counter(Class_1)

print "Class_1; \n"
f.write ("Class_1; \n")

for key, value in output_1.items():    
    f.write(str(key) + " ; " + str(value) + "\n")
    print(str(key) + " ; " + str(value))

print "\n" 

f.close()

The error ( IndexError: index 1 is out of bounds for size 1 ) is referencing the predict line res = clf.predict([row]) . As far as I can understand it, the issue is with there not being enough "classes" or label types for the data (I'm going for a binary classifier)? But I have been using this exact method (outside the nested loops) without any issue before.

https://codeshare.io/Gkpb44 - Code share link that contains my .csv data for the above mentioned .csv file.

The problem is that [row] is an array of length 1. Your program tries to access index 1, which does not exist (indices start with 0). It looks like you may want to do res = clf.predict(row) or take another look at the row variable. Hope this helps.

So I have realised what the issue was.

I've created a format where the classifier is loaded, and then using warm_start I re-fit the data to update the classifier to try and emulate incremental / online learning. This has worked well when I'm processing data that has both types of class in it. However if the data is only positive then when I re-fit the classifier it breaks it.

For now I have commented out the following;

clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
clf.fit(X, Y)
joblib.dump(clf, 'classifier/classifier.pkl')

which has solved the issue. Going forward I'll likely add in (yet another!) conditional statement to see if I should re-fit the data.

I was tempted to delete this question, however as I hadn't found anything that covered this fact during my searching I thought I would leave this up with the answer in case anyone finds they have the same issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM