简体   繁体   English

Python中的时间序列趋势识别

[英]time series trend recognization in python

I have a CSV containing selling figures for various dates. 我有一个CSV,其中包含各个日期的销售数据。 Here is an example of the file: 这是文件的示例:

DATE,       ARTICLENO, QUANTITY
2018-07-17, 101,       50
2018-07-16, 101,       55
2018-07-16, 105,       36
2018-07-15, 105,       23

I read this into a pandas dataframe and ran a basic kmeans-algorithm on this but i need more help. 我将其读入pandas数据框,并在此基础上运行了基本的kmeans算法,但我需要更多帮助。

Data description: The date column is the index of the dataframe and describes the date for the selling value. 数据描述:日期列是数据框的索引,并描述了销售价格的日期。 There are multiple tuples (Date-Quantity-ArticleNo) so there is a time series for each article number. 有多个元组(Date-Quantity-ArticleNo),因此每个商品编号都有一个时间序列。 Those can have different lengths and starting dates, which makes predicting and recognizing trends (eg good selling in summer or winter) even harder. 它们的长度和开始日期可能不同,这使得预测和识别趋势(例如,夏季或冬季销售良好)变得更加困难。 The CSV is sorted by ArticleNo and Date. CSV按商品编号和日期排序。

Goal: 目标:

Cluster a given set of data from a csv and create labels for good selling articles in summer or winter (seasonal trends) and match future articles to them. 从csv中聚集一组给定的数据,并为夏季或冬季(季节性趋势)的畅销商品创建标签,并将将来的商品与它们匹配。

Here is what I did so far (currently i did not have date as index xet, but that is the goal): 这是我到目前为止所做的事情(当前我没有将日期作为索引xet,但这是目标):

from __future__ import absolute_import, division, print_function
import pandas as pd
import numpy as np
from matplotlib import pyplot as plp
from sklearn import preprocessing
from sklearn.cluster import KMeans
import sys

def extract_articles(data, article_numbers):
    return pd.DataFrame(
    [
        data[data['ARTICLENO'] == article_no]['QUANTITY'].values
        for article_no in article_numbers
    ]
 ).fillna(0)


def read_csv_file(file_name, number_of_lines):
    return pd.read_csv(file_name, parse_dates=['DATE'], 
nrows=number_of_lines)

def get_unique_article_numbers(data):
    return data['ARTICLENO'].unique()


def main():
    data = read_csv_file('statistic.csv', 400000)



    modeling_article_numbers = get_unique_article_numbers(data)
    print("Clustering on", len(modeling_article_numbers), "article numbers")
    modeling_data = extract_articles(data, modeling_article_numbers)
    modeling_data = modeling_data.iloc[:50, :]
    # 'switch' dataframe
    modeling_data = modeling_data.T
    modeling_data = modeling_data.pct_change().fillna(0)
    normalized_modeling_data = preprocessing.normalize(modeling_data, 
    norm='l2', axis=0)
    print(modeling_data)


    predicting_article_numbers = [30079229, 30079854, 30086845]
    predicting_article_data = extract_articles(data, 
    predicting_article_numbers)
    predicting_article_data = predicting_article_data.pct_change().fillna(0)
    normalized_predicting_article_data = preprocessing.normalize( 
    predicting_article_data, norm='l2'
    )


    kmeans = KMeans(n_clusters=5, 
    random_state=0).fit(normalized_modeling_data)
    print(kmeans.labels_)
    # for data, article_no in [
        # (normalized_predicting_article_data, 430079229),
        # (normalized_predicting_article_data, 430079854),
        # (modeling_data, 430074590),
        # ]:
    # print('Predicting article {0}'.format(article_no))
    # print(kmeans.predict([data[0]]))

    for i, cluster_center in enumerate(kmeans.cluster_centers_):
         plp.plot(cluster_center, label='Center {0}'.format(i))
    plp.legend(loc='best')
    plp.title(('Cluster based on ' + str(len(modeling_article_numbers)) + ' 
    article numbers'))
    plp.show()


 main()

I transposed the dataframe, beacause it did not contain the series for each article number along the axis 1. My question is: How can i get the 'description' of the label? 我转置了数据框,因为它不包含沿轴1的每个商品编号的序列。我的问题是:如何获得标签的“描述”? Can i name them? 我可以命名吗? Maybe kmeans is the wrong algorithm for my intentions? 也许kmeans对我的意图来说是错误的算法?

have you tried making each article a row in your dataset? 您是否尝试过将每篇文章在数据集中排成一行?

I'm not sure if you did after reading your question. 我不确定您在阅读问题后是否这样做。

After you did that you can aggregate your date eg as quantity per week. 完成后,您可以汇总日期,例如每周的数量。 If you have more than one year data make it average quantity per week. 如果您有一年以上的数据,请使其为每周平均数量。 So you get a table with 52 Features {week 1 : sold 500; 因此,您得到一张具有52个功能的表格{第1周:售出500; week 2 : sold 520 ...} for every article. 第2周:每件商品售出520 ...}。

I dont think k-means is what you are looking for because you know pretty well what you want and that makes you a good "teacher" for your algorithm, ergo: use supervised algortihms. 我不认为k-means是您要寻找的东西,因为您非常了解自己想要的东西,这使您成为算法的一个很好的“老师”,所以,我:请使用监督算法。 Therefore you need to lable at least some (at best all) of your aggregated product data by hand, but it should be worth the work due to better results. 因此,您需要手工标记至少一些(最好是全部)汇总的产品数据,但是由于效果更好,因此值得进行这项工作。

Also you could look into Time-Series Sesonality Analysis / Time Series decomposition. 您也可以研究时间序列的共振分析/时间序列分解。

Anyway if you are familiar with sci-kit learn i would give the supervised algorithms (Decision Trees, Random Forest, SVM, MLPClassifier ...) a chance, might be way easier to accomplish. 无论如何,如果您熟悉sci-kit,我会给监督算法(决策树,随机森林,SVM,MLPClassifier ...)一个机会,可能更容易实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM