繁体   English   中英

R data.table包中的行引用

[英]Row Referencing in R data.table package

假设我有以下样本数据集:

iris <- data.table(iris)[c(1:5,51:55,101:105), list(ID=.I, Species,Sepal.Length)]

然后说我想计算组内行之间的绝对差异(在本例中为Species )。

iris[ , SL.Diff := c(NA,abs(diff(Sepal.Length))) , by = Species]

此时,我有一个如下所示的数据集:

   ID    Species Sepal.Length SL.Diff
1:  1     setosa          5.1      NA
2:  2     setosa          4.9     0.2
3:  3     setosa          4.7     0.2
4:  4     setosa          4.6     0.1
5:  5     setosa          5.0     0.4
6:  6 versicolor          7.0      NA

现在我想计算一个新变量Sepal.Length2 ,如果SL.Diff小于0.3的阈值,它将采用下一行的值。

iris[ , Sepal.Length2 := ifelse(SL.Diff < 0.3, iris[ID+1]$Sepal.Length, Sepal.Length)]

这按照我想要的方式工作。 但是,如果我想进行相同的比较,而不是采取下一行,我想采取前一行的值?

iris[ , Sepal.Length3 := ifelse(SL.Diff < 0.3, iris[ID-1]$Sepal.Length, Sepal.Length)]

Sepal.Length3没有给出我期望的输出。 谁知道我在这里做错了什么?

    ID    Species Sepal.Length SL.Diff Sepal.Length2 Sepal.Length3
 1:  1     setosa          5.1      NA            NA            NA
 2:  2     setosa          4.9     0.2           4.7           4.9
 3:  3     setosa          4.7     0.2           4.6           4.7
 4:  4     setosa          4.6     0.1           5.0           4.6
 5:  5     setosa          5.0     0.4           5.0           5.0
 6:  6 versicolor          7.0      NA            NA            NA
 7:  7 versicolor          6.4     0.6           6.4           6.4
 8:  8 versicolor          6.9     0.5           6.9           6.9
 9:  9 versicolor          5.5     1.4           5.5           5.5
10: 10 versicolor          6.5     1.0           6.5           6.5
11: 11  virginica          6.3      NA            NA            NA
12: 12  virginica          5.8     0.5           5.8           5.8
13: 13  virginica          7.1     1.3           7.1           7.1
14: 14  virginica          6.3     0.8           6.3           6.3
15: 15  virginica          6.5     0.2            NA           5.1

不确定这个速度的影响,但这是另一个尝试:

# make a column of the next values using head()
iris[, S3 := c(NA,head(Sepal.Length,-1)), by=Species]
# overwrite those values not meeting your criteria with the original values
iris[ !(SL.Diff < 0.3), S3 := Sepal.Length]

iris
#    ID    Species Sepal.Length SL.Diff  S3
# 1:  1     setosa          5.1      NA  NA
# 2:  2     setosa          4.9     0.2 5.1
# 3:  3     setosa          4.7     0.2 4.9
# 4:  4     setosa          4.6     0.1 4.7
# 5:  5     setosa          5.0     0.4 5.0
# 6:  6 versicolor          7.0      NA  NA
# 7:  7 versicolor          6.4     0.6 6.4
# 8:  8 versicolor          6.9     0.5 6.9
# 9:  9 versicolor          5.5     1.4 5.5
#10: 10 versicolor          6.5     1.0 6.5
#11: 11  virginica          6.3      NA  NA
#12: 12  virginica          5.8     0.5 5.8
#13: 13  virginica          7.1     1.3 7.1
#14: 14  virginica          6.3     0.8 6.3
#15: 15  virginica          6.5     0.2 6.3

data.table.[评估data.table.[范围内的ij

因此

iris[ID+1]$Sepal.Lengthiris范围内(第二次)评估ID

您的问题确实出现了,因为您正在创建一个0索引(由R静默删除)

a <- c('a','b')
a[0:1]
# [1] "a"
 a[1]
# [1] "a"

因此,您需要更好地处理“已知的NA值”和隐含的NA值。

这是一种方法

# calculate the "threshold" column
iris[,thresh := SL.Diff <0.3]
# where does it need to go "up" and what indexed value need it go up by
iris[!is.na(thresh), up := ifelse(thresh, ID+1L,ID)]
# create the column
iris[, S2 := Sepal.Length[up]]
# the same for "down"

iris[!is.na(thresh), down := ifelse(thresh, ID-1L,ID)]
iris[, S3 := Sepal.Length[down]]

iris
# ID       Species Sepal.Length SL.Diff thresh up  S2 down  S3
# 1:  1      setosa          5.1      NA     NA NA  NA   NA  NA
# 2:  2      setosa          4.9     0.2   TRUE  3 4.7    1 5.1
# 3:  3      setosa          4.7     0.2   TRUE  4 4.6    2 4.9
# 4:  4      setosa          4.6     0.1   TRUE  5 5.0    3 4.7
# 5:  5      setosa          5.0     0.4  FALSE  5 5.0    5 5.0
# 6:  6  versicolor          7.0      NA     NA NA  NA   NA  NA
# 7:  7  versicolor          6.4     0.6  FALSE  7 6.4    7 6.4
# 8:  8  versicolor          6.9     0.5  FALSE  8 6.9    8 6.9
# 9:  9  versicolor          5.5     1.4  FALSE  9 5.5    9 5.5
# 10: 10 versicolor          6.5     1.0  FALSE 10 6.5   10 6.5
# 11: 11  virginica          6.3      NA     NA NA  NA   NA  NA
# 12: 12  virginica          5.8     0.5  FALSE 12 5.8   12 5.8
# 13: 13  virginica          7.1     1.3  FALSE 13 7.1   13 7.1
# 14: 14  virginica          6.3     0.8  FALSE 14 6.3   14 6.3
# 15: 15  virginica          6.5     0.2   TRUE 16  NA   14 6.3

我认为dplyr通过提供lead()lag()函数使表达更容易一些:

library(dplyr)
iris2 <- iris[c(1:5, 51:55, 101:105), c("Species", "Sepal.Length")]
names(iris2) <- c("species", "sepal")
iris2$id <- 1:15

iris2 %>%
  group_by(species) %>%
  mutate(
    thres = abs(sepal - lag(sepal)),
    up =   ifelse(thres < 0.3, lead(sepal), sepal),
    down = ifelse(thres < 0.3, lag(sepal), sepal)
  )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM