简体   繁体   中英

scipy.optimize + kmeans clustering

I have the following setup for kmeans clustering algorithm that I am implementing for a project:

import numpy as np 
import scipy
import sys
import random
import matplotlib.pyplot as plt
import operator
class KMeansClass:
    #takes in an npArray like object
    def __init__(self,dataset,k):
        self.dataset=np.array(dataset)
        #initialize mins to maximum possible value
        self.min_x = sys.maxint
        self.min_y = sys.maxint
        #initialize maxs to minimum possible value
        self.max_x = -(sys.maxint)-1
        self.max_y = -(sys.maxint)-1
        self.k = k

        #a is the coefficient matrix that is continually updated as the centroids of the clusters change respectively.
        # It is an mxk matrix where each row corresponds to a training_instance and each column corresponds to a centroid of a cluster
        #Values are either 0 or 1. A value for a particular training_instance (data_point) is 1 only for that centroid to which the training_instance
        # has the least distance else the value is 0.
        self.a = np.zeros(shape=[self.dataset.shape[0],self.k])
        self.distanceMatrix = np.empty(shape =[self.dataset.shape[0],self.k])


        #initialize mu to zeros of the requisite shape array for now. Change this after implementing max and min methods.
        self.mu = np.empty(shape=[k,2])


        self.findMinMaxdataPoints()
        self.initializeCentroids()
        self.createDistanceMatrix()
        self.scatterPlotOfInitializedPoints()


    #pointa and pointb are npArray like vecors.
    def euclideanDistance(self,pointa,pointb):
        return  np.sqrt(np.sum((pointa - pointb)**2))

    """ Problem Initialization And Visualization Helper methods"""
    ##############################################################################
    #@param: dataset : list of tuples [(x1,y1),(x2,y2),...(xm,ym)]
    def findMinMaxdataPoints(self):
        for item in self.dataset:
            self.min_x = min(self.min_x,item[0])
            self.min_y = min(self.min_y,item[1])
            self.max_x = max(self.max_x,item[0])
            self.max_y = max(self.max_y,item[1])



    def initializeCentroids(self):
        for i in range(self.k):
            #each value of mu is a tuple with a random number between (min_x - max_x) and (min_y - max_y)
            self.mu[i] = (random.randint(self.min_x,self.max_x),random.randint(self.min_y,self.max_y))
            self.sortCentroids()   

        print self.mu

    def sortCentroids(self):

        #the following 3 lines of code are to ensure that the mu values are always sorted in ascending order first with respect to the
        #x values and then with respect to the y values.
        half_sorted = sorted(self.mu,key=operator.itemgetter(1))   #sort wrt y values
        full_sorted = sorted(half_sorted,key=operator.itemgetter(0)) #sort the y-sorted array wrt x-values
        self.mu = np.array(full_sorted)

    def scatterPlotOfInitializedPoints(self):
        plt.scatter([item[0] for item in self.dataset],[item[1] for item in self.dataset],color='b')
        plt.scatter([item[0] for item in self.mu],[item[1] for item in self.mu],color='r')
        plt.show()

    ###############################################################################

    #minimizing euclidean distance is the same as minimizing the square of the euclidean distance.
    def calcSquareEuclideanDistanceBetweenTwoPoints(point_a,point_b):
        return np.sum((pointa-pointb)**2)

    def createDistanceMatrix(self):
        for i in range(self.dataset.shape[0]):
            for j in range(self.k):
                self.distanceMatrix[i,j] = calcSquareEuclideanDistanceBetweenTwoPoints(self.dataset[i],self.mu[j])

    def createCoefficientMatrix(self):
        for i in range(self.dataset.shape[0]):
            self.a[i,self.distanceMatrix[i].argmin()] = 1

    #update functions for CoefficientMatrix and Centroid values:
    def updateCoefficientMatrix(self):
        for i in range(self.dataset.shape[0]):
            self.a[i,self.distanceMatrix[i].argmin()]= 1

    def updateCentroids(self):
        for j in range(self.k):
            non_zero_indices = np.nonzero(self.a[:,j])
            avg = 0
            for i in range(len(non_zero_indices[0])):
                avg+=self.a[non_zero_indices[0][i],j]

            self.mu[j] =  avg/len(non_zero_indices[0])

    ############################################################

    def lossFunction(self):
        loss=0;
        for j in range(self.k):
            #vectorized this implementation.
            loss+=np.sum(np.dot(self.a[:,j],self.distanceMatrix[:,j]))
        return loss

Here my question pertains to the lossFunction and how to use this with the scipy.optimize package. I would like to minimize the loss function iteratively by performing the following steps:

 Repeat until convergence:
      a> Optimize 'a' by keeping mu constant    ( I have an        
         updateCoefficientMatrix method for updating 'a' matrix which is an  
         mXk matrix where we have m training instances and k clusters.)
      b> Optimize 'mu' by keeping 'a' constant (I have an updateCentroids 
         method to do this. where mu is a mXk matrix wherein m is number of 
         training instances and k is the number of clusters and the number of  
         centroids)

But I am very new to using scipy.optimize package so I am writing to ask for help as to how to invoke the scipy.optimize to achieve my optimization goal as stated above?

Basically I have 2 m x k matrices and I would like to minimize a lossFunction() by first optimizing one m x k matrix keeping the other constant and in the succeeding step optimize the second matrix keeping the first constant. This can be considered a special case of the expectation maximization problem but unfortunately I haven't quite gotten what the documentation is trying to say so far hence thought I'd turn to SO for help.

Thanks in advance!

And this is part of a class assignment so please do not post code! Any guidance or explanation would be highly appreciated.

Use scipy.optimize.minimize twice with different objective functions.

First run optimization with an objective function that takes a as a parameter, and returns the objective value.

As the second step, run scipy.optimize.minimize for a second time on a second objective function that takes mu as a parameter.

When writing the objective functions, remember that Python has nested functions, which avoids the need for passing mu (in the first case) or a (in the second case) as additional arguments; although it can be done by minimize(..., args=[mu]) and minimize(..., args=[a]) .

Repeat the two-step process in a for loop, until the answer is such that your convergence condition is satisfied.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM