简体   繁体   中英

How to access cluster labels from a fit method in AWS Sagemaker

Background information:

AWS Sagemaker offers the possibility to use external Sklearn clustering methods, like DBSCAN, as well as internal clustering methods like kmeans for fitting and deploying/predicting. By default you have access to a clustered labels after deploying your method as a predictor object:

Example:

kmeans_customers_3 = KMeans(role=role,
                            instance_count=1,
                            instance_type='ml.c4.xlarge',
                            output_path=output_path_cluster, # specified, above
                            k=3, 
                            epochs=20,
                            sagemaker_session=sagemaker_session)
    
kmeans_customers_3.fit(some_data)
        
kmeans_predict_3 = kmeans_customers_3.deploy(
            initial_instance_count=1,
            instance_type="ml.t2.medium"
)
        
cluster_info=kmeans_predict_3.predict(aws_conform_data_in_record_set)
        
cluster_labels = [cluster.label['closest_cluster'].float32_tensor.values[0] for cluster in cluster_info]

Problem :

When using an external clustering method from sklearn, these methods mostly have no predict() function. Eg Agglomerative Clustering or DBSCAN have only a fit() or fit_predict() method, which is not compatible with AWS deploying, only methods that have a predict method, like Kmeans or affinity clustering, work well with AWS ( https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html )

Question :

How can I access a fitted clutering model from AWS, so that I have access to model.class_labels attributes after fit (in hope of not only using clustering methods that have a predict method)? I now how to download the model.tar.gz but, I'm a bit confused what to do with it, since opening it does not help.

It could be also possible to write an own predict function for such a method, that only returns class labels, however, I dont know how to do that in this environment, since AWS uses an SKLEARN object, from which I dont believe I can overwrite or the method of eg DBSCAN itself.

Any ideas how to retrieve class labels of clustering methods from a.fit method in AWS Sagemaker?

Once your Sklearn model is trained and saved in S3 as a model.tar.gz, you can download it to the client of your choice, untar it and re-open it with the same libraries you used to save it (pickle, joblib, etc).

If you're looking for the way to open the model.tar.gz after training the model with the built-in KMeans SageMaker algorithm , check the Analyze US census data for population segmentation SageMaker example, in particular, the section Accessing the KMeans model attributes that has this code sample:

Kmeans_model_params = mx.ndarray.load("model_algo-1")

The code sample, which you provided in your question, is correct, if you want to calculate (predict) the labels for all data points in your dataset.

In another Bring Your Own Model (k-means) example there's a code on how to pack your own KMeans model, eg trained with sklearn.cluster.KMeans for the inference inside SageMaker built-in KMeans container, in particular, this code is the main part:

centroids = mx.ndarray.array(kmeans.cluster_centers_)
mx.ndarray.save("model_algo-1", [centroids])

If you're looking for the way to host another SKLearn model in SageMaker, you need to create an inference.py script and define predict_fn() and model_fn() as in the SageMaker scikit-learn Bring Your Own Model example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM