简体   繁体   中英

Custom function in sklearn2pmml PMMLPipeline

I am trying to create a machine learning model to suggest treatment for stroke patients based on their responses to various questionnaires and assessments. For instance, the patient will be asked to rate the stiffness of the fingers, elbow, shoulder, and pectoral muscles (each on a scale of 0 to 100) or answer 14 questions related to mental health (each on a scale of 0 to 3).

I would like to create an sklearn pipeline roughly as follows:

1. The patient responses are aggregated. For example, the four stiffness responses should be averaged to create a single “stiffness” value, while the fourteen mental health questions should be summed up to create a single “mental health” value. The “stiffness” and “mental health” values would then be features in the model.

2. Once the features have been aggregated in this way, a decision tree classifier is trained on labeled data to assign each patient to the appropriate therapy.

3. The trained pipeline is exported as a pmml file for production

I assume this must be doable with some code like this:

from sklearn2pmml.pipeline import PMMLPipeline

from sklearn2pmml import sklearn2pmml

from sklearn.tree import DecisionTreeClassifier

from somewhere import Something

pipeline = PMMLPipeline([
    ("input_aggregation", Something()),
    ("classifier", DecisionTreeClassifier())
])

pipeline.fit(patient_input, therapy_labels)
 
sklearn2pmml(pipeline, "ClassificationPipeline.pmml", with_repr = True)

I've been poking around the documentation and I can figure out to apply PCA to a group of columns but not how to do something as straightforward as collapsing a group of columns by summing or averaging. Does anyone have any hints about how I could do this?

Thanks for your help.

You just need to define a custom function and use it in the Pipeline .

Here is the full code:

from sklearn.preprocessing import FunctionTransformer
import numpy as np
from sklearn2pmml import make_pmml_pipeline

# fake data with 7 columns
X = np.random.rand(10,7)

n_rows = X.shape[0]

def custom_function(X):
    #averiging 4 first columns, sums the others, column-wise
    return np.concatenate([np.mean(X[:,0:5],axis = 1).reshape(n_rows,1), np.sum(X[:,5:],axis=1).reshape(n_rows,1)],axis = 1)

# Now, if you run: `custom_function(X)` it should return an array (10,2).

pipeline = make_pmml_pipeline(
FunctionTransformer(custom_function),
    )

Sample code:

from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import Aggregator

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    (["stiffness_1", "stiffness_2", "stiffness_3", "stiffness_4"], Aggregator(function = "mean")),
    (["mental_health_1", "mental_health2", .., "mental_health_14"], Aggregator(function = "sum"))
  ])),
  ("classifier", DecisionTreeClassifier())
])
pipeline.fit(X, y)

Explanation - you can use sklearn_pandas.DataFrameMapper to define a column group, and apply a transformation to it. For the conversion to PMML work, you need to provide a transformer class, not a direct function. Perhaps all your transformation needs are handled by the sklearn2pmml.preprocessing.Aggregator transformer class. If not, you can always define your own.

While @makis has provided a 100% valid Python example, it wouldn't work in the Python-to-PMML case, because the converter cannot parse/handle custom Python functions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM