简体   繁体   中英

Time series classification with noisy labels using cleanlab and sktime

So I want to improve may sktime classifier using cleanlab. Here are some sample data

x=np.linspace(0,3,500)
X_true=np.array([randint(1,10)*np.sin(x) for _ in range(100)])
X_false=np.array([randint(1,10)*np.tan(x) for i in range(100)])
y=[True for _ in range (100)]+[False for _ in range (100)]
df=pd.concat([pd.DataFrame(X_true),pd.DataFrame(X_false)])
df['y']=y
df = df.sample(frac=1).reset_index(drop=True)
X=df.drop('y', axis=1).to_numpy()
y=df['y'].to_numpy()

This creates a dataset of time series for sin functions with a label True and tan function with label False . To create some label errors we set the first 20 targets to True

y[:20]=True

Now I am using the sktime classifier to find the labels for each time series, which is working fine

>>> X=from_2d_array_to_nested(X)
>>> clf=TimeSeriesForestClassifier(n_jobs=-1).fit(X,y)
>>> clf.score(X,y)
0.95

However, I want to use cleanlab to inform the classifier that some of his training labels might not be correct

>>> LearningWithNoisyLabels(clf=TimeSeriesForestClassifier()).fit(X,y)

But this results in a KeyError

KeyError: "None of [Int64Index([  1,   2,   4,   5,   6,   7,  11,  13,  15,  17,\n            ...\n            186, 187, 188, 190, 191, 192, 194, 196, 198, 199],\n           dtype='int64', length=160)] are in the [columns]"

Since LearningWithNoisyLabels is working for me with other classifiers, I guess there is a problem with the sktime classifier but I am not sure

Version Info:

>>> cleanlab.__version__, sktime.__version__
('0.1.1', '0.5.3')

Imports:

>>> from cleanlab.classification import LearningWithNoisyLabels
>>> from sktime.utils.data_processing import from_2d_array_to_nested
>>> from sktime.classification.all import TimeSeriesForestClassifier

The problem is that during LearningWithNoisyLabels(..).fit() the function cleanlab.latent_estiamtion.estimate_confident_joint_and_cv_pred_proba is throwing an exception since it does not handle the sktime feature format correctly. The result of from_2d_array_to_nested() is a pd.DataFrame with 1 column and a pd.Series in each cell.

However, if we define the TimeSeriesForestClassifier inside a pipeline that takes a normal np.array as input everything is working fine.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

clf=make_pipeline(FunctionTransformer(from_2d_array_to_nested),
                  TimeSeriesForestClassifier())
clf_clean=LearningWithNoisyLabels(clf)
clf_clean.fit(X,y)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM