简体   繁体   English

我们可以在 Python 中对多元时间序列数据集进行聚类吗

[英]Can we cluster Multivariate Time Series dataset in Python

I have a dataset with many financial signal values for different stocks at different times.For example我有一个数据集,其中包含不同时间不同股票的许多金融信号值。例如

StockName  Date   Signal1  Signal2
----------------------------------
Stock1     1/1/20    a       b
Stock1     1/2/20    c       d
.
.
.
Stock2     1/1/20    e       f
Stock2     1/2/20    g       h
.
.
.

I would like to build a time series table look like below and cluster stocks based on both signal1 and signal2 (2 variables)我想建立一个如下所示的时间序列表,并根据信号1和信号2(2个变量)对股票进行聚类

StockName   1/1/20    1/2/20    ........    Cluster#
----------------------------------------------------
 Stock1     [a,b]      [c,d]                    0
 Stock2     [e,f]      [g,h]                    1
 Stock3     ......     .....                    0
 .
 .
 .

1)Are there any ways to do this? 1)有没有办法做到这一点? (Clustering stocks based on multiple variables for the time series data). (基于时间序列数据的多个变量对股票进行聚类)。 I tried to search online but they are all about clustering time series based on one variable.我试图在网上搜索,但它们都是关于基于一个变量的聚类时间序列。

2)Also, are there any ways to cluster different stocks at different times as well? 2)另外,有没有办法在不同的时间对不同的股票进行聚类? (So maybe Stock1 at time1 is in the same cluster with Stock2 at time3) (因此,时间 1 的 Stock1 可能与时间 3 的 Stock2 位于同一个集群中)

I am revising my answer here, based on the new information that you last posted.我正在根据您上次发布的新信息在这里修改我的答案。

from utils import *

import time
import numpy as np

from mxnet import nd, autograd, gluon
from mxnet.gluon import nn, rnn
import mxnet as mx
import datetime
import seaborn as sns
import matplotlib.pyplot as plt

# %matplotlib inline
from sklearn.decomposition import PCA

import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

import xgboost as xgb
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

context = mx.cpu(); model_ctx=mx.cpu()
mx.random.seed(1719)

# Note: The purpose of this section (3. The Data) is to show the data preprocessing and to give rationale for using different sources of data, hence I will only use a subset of the full data (that is used for training).

def parser(x):
    return datetime.datetime.strptime(x,'%Y-%m-%d')

# dataset_ex_df = pd.read_csv('data/panel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)


import yfinance as yf

# Get the data for the stock AAPL
start = '2018-01-01'
end = '2020-04-22'

data = yf.download('GS', start, end)

data = data.reset_index()
data

在此处输入图像描述

    data.dtypes

    # re-name field from 'Adj Close' to 'Adj_Close'
    data = data.rename(columns={"Adj Close": "Adj_Close"})
    data

num_training_days = int(data.shape[0]*.7)
print('Number of training days: {}. Number of test days: {}.'.format(num_training_days, data.shape[0]-num_training_days))



# TECHNICAL INDICATORS
#def get_technical_indicators(dataset):
# Create 7 and 21 days Moving Average
data['ma7'] = data['Adj_Close'].rolling(window=7).mean()
data['ma21'] = data['Adj_Close'].rolling(window=21).mean()


# Create exponential weighted moving average
data['26ema'] = data['Adj_Close'].ewm(span=26).mean()
data['12ema'] = data['Adj_Close'].ewm(span=12).mean()
data['MACD'] = (data['12ema']-data['26ema'])

# Create Bollinger Bands
data['20sd'] = data['Adj_Close'].rolling(window=20).std() 
data['upper_band'] = data['ma21'] + (data['20sd']*2)
data['lower_band'] = data['ma21'] - (data['20sd']*2)

# Create Exponential moving average
data['ema'] = data['Adj_Close'].ewm(com=0.5).mean()

# Create Momentum
data['momentum'] = data['Adj_Close']-1



dataset_TI_df = data
dataset = data


def plot_technical_indicators(dataset, last_days):
    plt.figure(figsize=(16, 10), dpi=100)
    shape_0 = dataset.shape[0]
    xmacd_ = shape_0-last_days

    dataset = dataset.iloc[-last_days:, :]
    x_ = range(3, dataset.shape[0])
    x_ =list(dataset.index)

    # Plot first subplot
    plt.subplot(2, 1, 1)
    plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')
    plt.plot(dataset['Adj_Close'],label='Closing Price', color='b')
    plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')
    plt.plot(dataset['upper_band'],label='Upper Band', color='c')
    plt.plot(dataset['lower_band'],label='Lower Band', color='c')
    plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)
    plt.title('Technical indicators for Goldman Sachs - last {} days.'.format(last_days))
    plt.ylabel('USD')
    plt.legend()

    # Plot second subplot
    plt.subplot(2, 1, 2)
    plt.title('MACD')
    plt.plot(dataset['MACD'],label='MACD', linestyle='-.')
    plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')
    plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')
    # plt.plot(dataset['log_momentum'],label='Momentum', color='b',linestyle='-')

    plt.legend()
    plt.show()

plot_technical_indicators(dataset_TI_df, 400)

在此处输入图像描述

This will give you some signals to work with.这将为您提供一些可以使用的信号。 Of course, these features can be anything you want.当然,这些功能可以是您想要的任何东西。 I'm sure you know this is technical analysis, and not fundamental analysis.我相信你知道这是技术分析,而不是基本面分析。 Now, you can do your clustering, and whatever else you want, at this point.现在,您可以在这一点上进行聚类,以及您想要的任何其他内容。

Here is a good link for clustering.这是一个很好的聚类链接。

https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/ https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/

Good material to read (Title: Time Series Clustering and Dimensionality Reduction)好读的材料(标题:时间序列聚类和降维)

https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3 https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM