简体   繁体   English

通过机器学习找到日常模式

[英]Finding daily patterns with machine learning

I have created a huge log of daily activity in the format [timestamp, location]. 我以[timestamp,location]格式创建了一个巨大的日常活动日志。 For example 例如

[{1365650747255, 'san francisco'},
 {1365650743354, 'san francisco'},
 {1365650741349, 'san mateo'},
 {1365650756324, 'mountain view'},
 ...
 {1365650813354, 'menlo park'}]

What are the ways I can mine this information to find patterns like 有什么方法可以挖掘这些信息来找到像这样的模式

  • "On Sunday evenings, it's probable that I am near San Francisco" “星期天晚上,我很可能在旧金山附近”
  • "On Monday afternoons it's probable that I am near Menlo Park" “星期一下午,我很可能在门洛帕克附近”

The problem is that 问题是

  • The dataset is huge. 数据集很大。
  • it looks impossible to judge the date/time/day by applying a function on the timestamp value (unless we decode the timestamp in to Date Time values). 通过在时间戳值上应用函数来判断日期/时间/日期是不可能的(除非我们将时间戳解码为日期时间值)。

I do not see your problem here. 我在这里看不到你的问题。 As it is a timestamp counting seconds from epoch you only have to apply the modulo operator with the value being the range of interest. 由于它是一个从纪元开始计算秒数的时间戳,因此您只需应用模数运算符,其值为感兴趣的范围。 If you train a classifier on that you should be able to predict every upcoming place. 如果您训练分类器,您应该能够预测每个即将到来的地方。 The main problem is not performance, as the learning is only done now and then, but how to update the learned dataset. 主要问题不是性能,因为学习只是偶尔进行,而是如何更新学习的数据集。 As already stated you do not have to use machine learning for this however if you want to do it using machine learning this can basically be done using a k-nearest-neighbor on your 1d dataset. 如前所述,您不必使用机器学习,但是如果您想使用机器学习这样做,这基本上可以使用您的1d数据集上的k-nearest-neighbor来完成。

[EDIT]: Mixed up languages but fixed it: A classifier is the algorithm which will do the statistical classification. [编辑]:混合语言但修正了它:分类器是进行统计分类的算法。

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 在机器学习和统计中,分类是基于包含其类别成员资格已知的观察(或实例)的训练数据集来识别新观察所属的一组类别(子群体)中的哪一个的问题。 [ 1 ] [ 1 ]

As I only have used sklearn to do such things the following is a minimalistic example of how you could use a k-nearest-neighbor classifier [ 2 ]. 由于我只使用了sklearn来做这些事情,以下是如何使用k-最近邻分类器的简约示例[ 2 ]。 To be able to classify you have to change the strings into numbers, then train your classifier on the given test dataset and afterwards you are able to predict the location for a new given timestamp. 为了能够进行分类,您必须将字符串更改为数字,然后在给定的测试数据集上训练分类器,之后您可以预测新给定时间戳的位置。

import numpy as np
from sklearn.neighbors import KNeighborsClassifier


data = [[1365650747255, 'san francisco'],
        [1365650743354, 'san francisco'],
        [1365650741349, 'san mateo'],
        [1365650756324, 'mountain view'],
        ...
        [1365650813354, 'menlo park']]

# Map location strings to integers and replace
location_mapping = {}
location_index = 0
for index, (time, location) in enumerate(data):
    if(not location_mapping.has_key(location)):
        location_mapping[location] = location_index
        location_index += 1

    data[index][1] = location_mapping[location]

inverse_location_mapping = {value:key for key, value in location_mapping.items()}

data = np.array(data)
week = 60 * 60 * 24 * 7

# Setup classifier
classifier = KNeighborsClassifier(n_neighbors=10)

# Train classifier on given data
classifier.fit(data[:, 0] % week, data[:, 1]) 

# Predict desired location
prediction = classifier.predict([[1365444444444 % week]]))
print(inverse_location_mapping[prediction])

[ 1 ] : http://en.wikipedia.org/wiki/Statistical_classification [ 1 ]: http//en.wikipedia.org/wiki/Statistical_classification

[ 2 ] : http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html [ 2 ]: http//scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

The performance is this solution depends on how granular your requirement for pattern recognition is. 性能是这个解决方案取决于您对模式识别的要求的细化程度。 Lets assume your requirement is dividing the day into 4 parts : Morning,Noon,Evening,Night , lets call them time_slots 让我们假设你的要求是time_slots一天分成4个部分: Morning,Noon,Evening,Night ,我们称之为time_slots

Now lets take a look at how big your daily activity log is, 1 year, 2 years , 3 years ? 现在让我们来看看您的日常活动日志有多大,1年,2年,3年?

lets assume it is 1 year. 我们假设它是1年。

So we have total of 365 * 4 = 1460 timeslots to monitor. 所以我们总共要监测365 * 4 = 1460个时隙。

Now,create a simple map based on timestamps for each time_slot . 现在,根据每个time_slot时间戳创建一个简单的映射。 Eg. 例如。 It begins on T1 and ends on T2 ( where T1 and T2 are timestamps like 1365650813354 ). 它从T1开始并在T2结束(其中T1和T2是时间戳,如1365650813354)。

Based on timestamp value in your log, it is easy to find its time_slot ie Evening of 28th January, or Morning of 30th January. 根据日志中的时间戳值,可以轻松找到其time_slot即1月28 time_slot或1月30日早晨。

You will have to store time_slot vs place_i_was data in any suitable database with proper schema. 您必须将time_slot vs place_i_was数据存储在具有适当模式的任何合适数据库中。 That depends on kind of querying and analylsis you would want. 这取决于您想要的查询和分析类型。

This way you will not need to run formulas on your dataset, and the predefined map/database lookup will serve your purpose. 这样,您就不需要在数据集上运行公式,并且预定义的地图/数据库查找将满足您的目的。

Not sure these questions require machine learning, you can use regular statistics for that. 不确定这些问题是否需要机器学习,您可以使用常规统计数据。 Ie build a probability distribution plot, x - time of day, y - probability it is San Francisco. 即构建概率分布图, x - 时间, y - 概率它是旧金山。 Calculate the probability of San Francisco if time is between a and b ... 如果时间介于ab之间,计算旧金山的概率......


This is how to load your data in pandas DataFrame: 这是在pandas DataFrame中加载数据的方法:

from __future__ import print_function, division
import pandas as pd
import datetime

df = pd.read_csv("data.csv",
                 names=["timestamp","location"],
                 parse_dates=["timestamp"],
                 date_parser=lambda x:datetime.datetime.fromtimestamp(int(x) / 1000))
print(df.head())

Outputs: 输出:

                    timestamp          location
0  2013-04-11 04:25:47.255000   "san francisco"
1  2013-04-11 04:25:43.354000   "san francisco"
2  2013-04-11 04:25:41.349000       "san mateo"
3  2013-04-11 04:25:56.324000   "mountain view"
4  2013-04-11 04:26:53.354000      "menlo park"

Convert the timestamps into tokens: "sunday morning". 将时间戳转换为标记:“星期天早晨”。

Then do association rule mining to obtain rules such as 然后执行关联规则挖掘以获取诸如的规则

night => home
sunday morning => running in the park

where you only keep those rules, where the desired locations occur on the right. 你只保留那些规则,在右边出现所需的位置。

Firstly, convert the timestamp value to year-month-weekday. 首先,将时间戳值转换为年 - 月 - 工作日。 Replace the timestamp column by 3 columns corresponding to year, month and weekday. 将timestamp列替换为与年,月和工作日对应的3列。

Later you could simply group by certain range of values for dates and count the number of instances for each location. 稍后您可以简单地按日期的特定值范围进行分组,并计算每个位置的实例数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM