簡體   English   中英

kmeans 為我在 R 中的時間序列數據集返回錯誤

[英]kmeans returns an error for my time-series data sets in R

我有一個時間序列數據集。 此處提供 Excel 格式的數據。 我想使用 k-means 對數據進行聚類。 但是,我有一個錯誤。

**請注意, FinDat是我來自所附來源的數據。

  > head(FinDat)
# A tibble: 6 x 10
  date                 ISE...2  ISE...3       SP      DAX     FTSE   NIKKEI  BOVESPA       EU
  <dttm>                 <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 2009-01-05 00:00:00  0.0358   0.0384  -0.00468  0.00219  3.89e-3  0        0.0312   0.0127 
2 2009-01-06 00:00:00  0.0254   0.0318   0.00779  0.00846  1.29e-2  0.00416  0.0189   0.0113 
3 2009-01-07 00:00:00 -0.0289  -0.0264  -0.0305  -0.0178  -2.87e-2  0.0173  -0.0359  -0.0171 
4 2009-01-08 00:00:00 -0.0622  -0.0847   0.00339 -0.0117  -4.66e-4 -0.0401   0.0283  -0.00556
5 2009-01-09 00:00:00  0.00986  0.00966 -0.0215  -0.0199  -1.27e-2 -0.00447 -0.00976 -0.0110 
6 2009-01-12 00:00:00 -0.0292  -0.0424  -0.0228  -0.0135  -5.03e-3 -0.0490  -0.0538  -0.0125 
# ... with 1 more variable: EM <dbl>

silhouette_score <- function(k){
  km <- kmeans(FinDat, centers = k, nstart=25)
  ss <- silhouette(km$cluster, dist(FinDat))
  mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)

which returns:

        Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    In addition: Warning message:
    In storage.mode(x) <- "double" : NAs introduced by coercion

似乎kmeans不喜歡日期列,您可能想排除它

library(cluster)
silhouette_score <- function(k) {
  stopifnot(!k > nrow(FinDat) - 1)
  km <- kmeans(FinDat[-1], centers=k, nstart=25)
  ss <- silhouette(km$cluster, dist(FinDat[-1]))
  return(setNames(mean(ss[, 3]), k))
}

k <- 2:5
avg_sil <- sapply(k, silhouette_score)
avg_sil
#         2         3         4         5 
# 0.3791762 0.3302388 0.2735529 0.2133566 

或者使用data.matrix將所有列轉換為數字。

silhouette_score2 <- function(k) {
  stopifnot(!k > nrow(FinDat) - 1)
  FinDat <- data.matrix(FinDat)
  km <- kmeans(FinDat, centers=k, nstart=25)
  ss <- silhouette(km$cluster, dist(FinDat))
  return(setNames(mean(ss[, 3]), k))
}

k <- 2:5
avg_sil <- sapply(k, silhouette_score2)
avg_sil
#          2          3          4          5 
# 0.40783229 0.37777778 0.21111111 0.08333333

數據:

FinDat <- structure(list(date = structure(c(1231110000, 1231196400, 1231282800, 
1231369200, 1231455600, 1231714800), class = c("POSIXct", "POSIXt"
), tzone = ""), ISE...2 = c(0.0358, 0.0254, -0.0289, -0.0622, 
0.00986, -0.0292), ISE...3 = c(0.0384, 0.0318, -0.0264, -0.0847, 
0.00966, -0.0424), SP = c(-0.00468, 0.00779, -0.0305, 0.00339, 
-0.0215, -0.0228), DAX = c(0.00219, 0.00846, -0.0178, -0.0117, 
-0.0199, -0.0135), FTSE = c(0.00389, 0.0129, -0.0287, -0.000466, 
-0.0127, -0.00503), NIKKEI = c(0, 0.00416, 0.0173, -0.0401, -0.00447, 
-0.049), BOVESPA = c(0.0312, 0.0189, -0.0359, 0.0283, -0.00976, 
-0.0538), EU = c(0.0127, 0.0113, -0.0171, -0.00556, -0.011, -0.0125
)), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM