简体   繁体   中英

What pre-processing methods do I need for Timestamp, Duration data for use with DBSCAN?

I have a month's worth of data that is in the form of:


            timestamp  duration
0 2015-10-01 00:00:08    2912.0
1 2015-10-01 00:48:58      30.0
2 2015-10-01 00:49:58     229.0
3 2015-10-01 00:54:07    4122.0
4 2015-10-01 02:03:19       0.0
...

And I wish to perform clustering based on the dimensions 'time of day in HH:MM:SS' and 'Duration' using DBSCAN from the scikit-learn library,

I understand that there needs to be a preprocessing step before using clustering but I do not know which one to use!

Would appreciate if someone could point me in the right direction.

Thank you!

Here a dummy answer: I am in a similar situation with a classification problem. Classification algorithms are not so different to clustering ones results-wise, since the goal is to group them by similar patterns.

You can google "Pre-processing techniques for classification of mixed data" or similar.

The main idea is to convert the time stamp into "categorical variables" and binarizing them later, so you will have year: 1,0,1,1,1, etc., month:1,0,0,0,0,0,0,... (for january with 12 variables),... or you could also divide the months in seasons, so you would have 4 seasons, etc. You need to understand what is really correlated with the expected output, though. ie, 4 independent variables.

Hope it helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM