I have a month's worth of data that is in the form of:
timestamp duration
0 2015-10-01 00:00:08 2912.0
1 2015-10-01 00:48:58 30.0
2 2015-10-01 00:49:58 229.0
3 2015-10-01 00:54:07 4122.0
4 2015-10-01 02:03:19 0.0
...
And I wish to perform clustering based on the dimensions 'time of day in HH:MM:SS' and 'Duration' using DBSCAN from the scikit-learn library,
I understand that there needs to be a preprocessing step before using clustering but I do not know which one to use!
Would appreciate if someone could point me in the right direction.
Thank you!
Here a dummy answer: I am in a similar situation with a classification problem. Classification algorithms are not so different to clustering ones results-wise, since the goal is to group them by similar patterns.
You can google "Pre-processing techniques for classification of mixed data" or similar.
The main idea is to convert the time stamp into "categorical variables" and binarizing them later, so you will have year: 1,0,1,1,1, etc., month:1,0,0,0,0,0,0,... (for january with 12 variables),... or you could also divide the months in seasons, so you would have 4 seasons, etc. You need to understand what is really correlated with the expected output, though. ie, 4 independent variables.
Hope it helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.