简体   繁体   English

使用dplyr和/或cut将连续变量分为几类

[英]Breaking a continuous variable into categories using dplyr and/or cut

I have a dataset that is a record of price changes, among other variables. 我有一个记录价格变化以及其他变量的数据集。 I would like to mutate the price column into a categorical variable. 我想将价格列突变为分类变量。 I understand that the two functions of importance here in R seem to be dplyr and/or cut . 我了解R中此处重要的两个功能似乎是dplyr和/或cut

> head(btc_data)
                 time  btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663


>dput(btc_data)
        ("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
        "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
        "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
        "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
        "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
        "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
        "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
        "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
        "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
        "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
        "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
        "5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
    ), class = "data.frame", row.names = c(NA, -10023L))

The difficulty is in the categories I want to create. 困难在于我要创建的类别。 The categories -1,0,1 should be based upon the % change over the previous time-lag. 类别-1,0,1应基于前一个时滞的%变化。

So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0. 因此,例如,在过去60分钟内价格上涨20%会被标记为1,否则为0。在过去60分钟内价格下跌20%应该是-1,否则为0。

Is this possible in R? 在R中这可能吗? What is the most efficient way to implement the change? 实施更改的最有效方法是什么?

There is a similar question here and also here but these do not answer my question for two reasons- 还有一个类似的问题在这里这里 ,但这些不回答我的问题有两个reasons-

a) I am trying to calculate % change, not simply the difference between 2 rows. a)我正在尝试计算%变化,而不仅仅是两个行之间的差异。

b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1 b)此计算应基于过去时间范围内的最大值/最小值(即,过去一小时减少20%= -1,过去一小时增加20%= 1

Its always difficult to work with percentage. 用百分比总是很难工作。 You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. 您需要意识到每件事都是灵活的:当您选择一个有差异的参考,均值,最大值或其他参数时,您至少需要仔细选择两个变量。 The same thing with the value you want to set in relation to your reference. 与您要相对于参考设置的值相同。 Together this give you almost infinite possible how you can calculate your percentage. 这些共同给您几乎无限可能如何计算百分比。 Here is the key to your question. 这是您问题的关键。

# create the data

dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
         "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
         "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
         "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
         "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
         "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
         "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
         "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
         "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
         "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
         "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
         "5,013.9075")
dat <- as.numeric(gsub(",","",dat))

# calculate the difference to the last minute
dd <- diff(dat)

# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
  out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}

# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
  out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}

# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))


# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")

I think you search the way how to calculate the last plot ( price last minute, relative to "mean" of t... ) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. 我认为您正在搜索如何计算最后一个图( price last minute, relative to "mean" of t... )的方式,此示例中的值在1.0010和1.0025之间变化,与您期望的0.8和1.2相差甚远。 You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. 当您选择大于20分钟的时间间隔时,您可以使差值更大,也许一周会比较好(11340),但是即使具有如此高的时间值,也很难实现高于1.2的值。 The problem is the high price of 5000 a change of 10 is very little. 问题是5000的高价10的变化很小。

You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1. 您还必须考虑到您给出的价格不断上涨,不可能得到低于1的值。

In this calculation I use the mean() for the running observation of the last minutes. 在此计算中,我将mean()用于最后几分钟的运行观察。 I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. 我不确定,但我推测在股票市场上,您会同时使用min()max()作为不同时间间隔的参考。 You choose min() as reference when your price is rising and max() when your price is falling. 当价格上涨时,您选择min()作为参考;当价格下跌时,您选择max() All this is possible in R. 这一切在R中都是可能的。

Here's an easy way to do this without having to rely on the data.table package. 这是一种简单的方法,无需依赖于data.table包。 If you want this for only 60 minute intervals, you would first need to filter btc_data for the relevant 60 minute intervals. 如果只希望间隔60分钟,则首先需要过滤btc_data间隔60分钟。

# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)

# sort data frame
btc_data = btc_data[order(btc_data$time),]

# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]

# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0)) 

Using the data.table package and converting btc_data to a data.table would be a much more efficient and faster way to do this. 使用data.table包并将btc_data转换为data.table将是一种更高效,更快捷的方法。 There is a learning curve to using the package, but there are great vignettes and tutorials for this package. 使用该程序包需要学习,但是此程序包有不错的小插曲和教程。

I can't completely reproduce your example, but if I had to guess you would want to do something like this: 我无法完全重现您的示例,但是如果我不得不猜测您会想要执行以下操作:

btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "", 
btc_data$btc_price)))


pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) / 
btc_data$btc_price[i - 60,]
}


pct_change <- pct_change[61:length(pct_change)]


new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2, 
max(pct_change)), labels = c(-1,0,1))

btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM