简体   繁体   中英

K-Means clustering in R error

I have a dataset that I have created in R. It is structured as follows:

> head(btc_data)
           Date btc_close eth_close vix_close gold_close DEXCHUS change
1647 2010-07-18      0.09        NA        NA         NA      NA      0
1648 2010-07-19      0.08        NA     25.97    115.730      NA     -1
1649 2010-07-20      0.07        NA     23.93    116.650      NA     -1
1650 2010-07-21      0.08        NA     25.64    115.850      NA      1
1651 2010-07-22      0.05        NA     24.63    116.863      NA     -1
1652 2010-07-23      0.06        NA     23.47    116.090      NA      1

I am trying to cluster the observations using k-means. However, I get the following error message:

> km <- kmeans(trainingDS, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion 

What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it? I cant drop the NA's because out of 4500 initial observations, if i run complete cases I am left with only 100 observations.

Essentially I am hoping that 3 clusters will form based on the change column which has values of -1,0,1. I then wish to analyze the components of each cluster to find the strongest predictors for change. What other algorithms that would be most useful for doing this?

I also tried to remove all the NA values using the following code, but I still get the same error message:

> complete_cases <- btc_data[complete.cases(btc_data), ]
> km <- kmeans(complete_cases, 3, nstart = 20)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion

> sum(!sapply(btc_data, is.finite)) 
[1] 8008
> sum(sapply(btc_data, is.nan))
[1] 0
> 
> sum(!sapply(complete_cases, is.finite)) 
[1] 0
> sum(sapply(complete_cases, is.nan))
[1] 0

Here is the format of the data:

> sapply(btc_data, class)
      Date  btc_close  eth_close  vix_close gold_close    DEXCHUS     change 
    "Date"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"   "factor" 

There is a variety of reasons for getting this error message, in particular in the presence of invalid data types (NA, NaN, Inf) or dates. Let's go through them:

But first, let's check that it works with the mtcars dataset since I will be using it:

kmeans(mtcars, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
--- lengthy output omitted

Likely problem 1: invalid data types : NA/NaN/Inf

df <- mtcars
df[1,1] <- NA
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

df[1,1] <- Inf
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

df[1,1] <- NaN
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

You can check for these values using the following:

df[1:3,1] <- c(NA, Inf, NaN) # one NA, one Inf, one NaN
sum(sapply(df, is.na))
[1] 2
sum(sapply(df, is.infinite))
[1] 1
sum(sapply(df, is.nan))
[1] 1

To get rid of these, we can remove the corresponding observations. But note that complete.cases does not remove Inf :

complete_df <- df[complete.cases(df),]
sum(sapply(complete_df, is.infinite))
[1] 1

Instead, use eg

df[apply(sapply(df, is.finite), 1, all),]

You can also reassign these values or impute them, but this is a whole different procedure.

Likely problem II: Dates: See the following:

library(lubridate)
df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
kmeans(df, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(df, 3) : NAs introduced by coercion

You can get around this problem by excluding the dates or by converting the dates to something else, eg

df$newdate <- seq_along(df$date)
df$date <- NULL
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 7, 16
---- lengthy output omitted

Or you can try to coerce the dates to numeric yourself before you pass it to kmeans :

df <- mtcars
df$date <- seq.Date(from=ymd("1990-01-01"), length.out = nrow(df), by=1)
df$date <- as.numeric(df$date)
kmeans(df, 3)
K-means clustering with 3 clusters of sizes 9, 16, 7
--- lengthy output omitted

Check datatype of the variable on which you are clustering. Most probably the error can come if the datatype is non-numeric. Also try handling date formats properly before you cluster.

Did you use "Date" column in clustering?

You should use numeric type data in using k-means clustering.

try this,

btc_data$Date = as.numeric(gsub("-", "", as.character(btc_data$Date)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM