简体   繁体   中英

How to use time-series data in classification in sklearn

I have a time-series dataset as follows where I record 2 time-series for each of my sensors. The Label column depicts if the sensor is faulty or not (ie 0 or 1 ).

sensor, time-series 1, time-series 2, Label
x1, [38, 38, 35, 33, 32], [18, 18, 12, 11, 09], 1
x2, [33, 32, 35, 36, 32], [13, 12, 15, 16, 12], 0
and so on ..

Currently, I am considering different features from the two time-series (eg, min, max, median, slope etc.) and consider them for classification as follows in randomforest classier in sklearn.

df = pd.read_csv(input_file)
X = df[[myfeatures]]
y = df['Label']

#Random Forest classifier
clf=RandomForestClassifier(random_state = 42, class_weight="balanced", criterion = 'gini', max_depth = 3, max_features = 'auto', n_estimators = 500)

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

output = cross_validate(clf, X, y, cv=k_fold, scoring = 'roc_auc', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
    print("Features sorted by their score for estimator {}:".format(idx))
    feature_temp_importances = pd.DataFrame(estimator.feature_importances_,
                                       index = mylist,
                                        columns=['importance']).sort_values('importance', ascending=False)
    print(feature_temp_importances)

However, my results are very low. I am wondering if it possible to give the time-series data as it is to random forest classifier. For example, giving x1 features as [38, 38, 35, 33, 32], [18, 18, 12, 11, 09] . If it is possible, I would like to know how I can do it in sklearn?

I am happy to provide more details if needed.

If you want to feed the whole time series to the model and use that to make predictions you should try with RNNs.

Another option, if you wonder to continue with sklearn is to apply rolling mean or rolling std to your time series, so x at time t would be influenced by x at time t - 1 and so on. With thiw correlation you will be able to classify each point to an specific class and therefore classify the whole timeseries corresponding the points' major label.

Yes, you can use the entire time-series data as the features for your classifier.

To do that, just use the raw data, concatenate the 2 time series for each sensor and feed it into the classifier.

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
import numpy as np

n_samples = 100

# generates 2 n_samples random time series with integer values from 0 to 100.
x1 = np.array([np.random.randint(0, 100, 5) for _ in range(n_samples)])
x2 = np.array([np.random.randint(0, 100, 5) for _ in range(n_samples)])

X = np.hstack((x1, x2))


# generates n_samples random binary labels.
y = np.random.randint(0, 2, n_samples)

#Random Forest classifier
clf=RandomForestClassifier(random_state = 42, class_weight="balanced", criterion = 'gini', max_depth = 3, max_features = 'auto', n_estimators = 500)

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

output = cross_validate(clf, X, y, cv=k_fold, scoring = 'roc_auc', return_estimator =True)

However, you might not want to use a random forest with those features. Have a look at LSTM or even 1-D CNNs, they might be more suitable for this approach of using the entire time-series as inputs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM