[英]how to find min and max number of intervening items between first multiple occurences of values in a data frame column
I have a factor vector called Categories with 47 levels 我有一个因子类别为47级的因子向量
Categories = as.factor(sort(make.unique(rep(letters, length.out = 47), sep='')))
[1] a a1 b b1 c c1 d d1 e e1 f f1 g g1 h h1 i i1 j j1 k k1 l l1 m m1 n n1 o o1 p p1 q q1 r r1 s s1 t
[40] t1 u u1 v w x y z
47 Levels: a a1 b b1 c c1 d d1 e e1 f f1 g g1 h h1 i i1 j j1 k k1 l l1 m m1 n n1 o o1 p p1 q q1 r r1 s s1 t t1 u u1 ... z
I have another vector called cat with 9 of those levels 我还有一个叫cat的矢量,其中有9个级别
cat = Categories[c(7,42,43,24,45,26,35,6,15)]
[1] d u1 v l1 x m1 r c1 h
47 Levels: a a1 b b1 c c1 d d1 e e1 f f1 g g1 h h1 i i1 j j1 k k1 l l1 m m1 n n1 o o1 p p1 q q1 r r1 s s1 t t1 u u1 ... z
I also have a dataframe My_Data with 36 rows. 我也有一个36行的数据框My_Data。 One of the columns in the dataframe has multiple occurences of the values from cat
数据框中的一列具有来自cat的值的多次出现
My_Data = data.frame(name = make.unique(rep(c(1:10,LETTERS), length.out = 36), sep=''), cat = sample(rep(cat,4),36,replace = FALSE), position = 0)
name cat position
1 1 v 0
2 2 r 0
3 3 h 0
4 4 m1 0
5 5 h 0
6 6 u1 0
7 7 l1 0
8 8 h 0
9 9 u1 0
10 10 x 0
11 A x 0
12 B v 0
13 C d 0
14 D c1 0
15 E r 0
16 F v 0
17 G l1 0
18 H d 0
19 I l1 0
20 J c1 0
21 K u1 0
22 L x 0
23 M v 0
24 N d 0
25 O l1 0
26 P m1 0
27 Q r 0
28 R m1 0
29 S h 0
30 T m1 0
31 U c1 0
32 V d 0
33 W r 0
34 X x 0
35 Y c1 0
36 Z u1 0
Using the code below, I can populate the position column given above to reflect the number of occurence of the value in the cat column: 使用下面的代码,我可以填充上面给出的position列,以反映cat列中值的出现次数:
transform(My_Data, position = ave(as.character(cat), cat, FUN = seq_along))
The first 15 rows of the dataframe My_Data would look like: 数据框My_Data的前15行如下所示:
name cat position
1 1 v 1
2 2 r 1
3 3 h 1
4 4 m1 1
5 5 h 2
6 6 u1 1
7 7 l1 1
8 8 h 3
9 9 u1 2
10 10 x 1
11 A x 2
12 B v 2
13 C d 1
14 D c1 1
15 E r 2
Now I want to calculate the min. 现在我要计算最小值。 and max.
和最大 number of intervening items between any 2 consecutive occurences of the same value of the cat column.
cat列的相同值的任何两个连续出现之间的干预项数。
How can I do this? 我怎样才能做到这一点?
If I understand your question, here's one option: 如果我了解您的问题,请选择以下一种方法:
library(tidyverse)
# Data
Categories = as.factor(sort(make.unique(rep(letters, length.out = 47), sep='')))
cat = Categories[c(7,42,43,24,45,26,35,6,15)]
# Set a seed for reproducibility
set.seed(5)
My_Data = data.frame(name = make.unique(rep(c(1:10,LETTERS), length.out = 36), sep=''),
cat = sample(rep(cat,4),36,replace = FALSE),
position = 0)
The code below summarises to give the minimum and maximum number of intervening rows for each level of cat
. 以下代码进行了总结,以给出每级
cat
的最小和最大中间行数。
# Summarise to give min and max number rows between each occurrence
My_Data %>%
mutate(row=1:n()) %>%
group_by(cat) %>%
summarise(min.diff=min(diff(row)-1, na.rm=TRUE),
max.diff=max(diff(row)-1, na.rm=TRUE))
cat min.diff max.diff <fctr> <dbl> <dbl> 1 c1 4 6 2 d 1 16 3 h 1 16 4 l1 0 13 5 m1 0 12 6 r 5 15 7 u1 2 7 8 v 1 16 9 x 6 12
If you want to mark the number of intervening rows in the original data frame: The code below adds a column to the original data frame to give the number of intervening rows since the last occurrence of a given level of cat
. 如果要标记原始数据帧中的中间行数:下面的代码在原始数据帧中添加一列,以给出自上次出现给定级别的
cat
的中间行数。
# Add column with intervening number of rows between each occurrence in cat
My_Data %>%
mutate(row=1:n()) %>%
group_by(cat) %>%
mutate(diff=c(NA,diff(row)-1)) %>%
select(-row)
name cat position diff <fctr> <fctr> <dbl> <dbl> 1 1 c1 0 NA 2 2 m1 0 NA 3 3 x 0 NA 4 4 d 0 NA 5 5 l1 0 NA 6 6 l1 0 0 7 7 r 0 NA 8 8 c1 0 6 9 9 h 0 NA 10 10 v 0 NA
Here is a tidy solution using lag()
: 这是使用
lag()
的整洁解决方案:
library(tidyverse)
# create data frame
set.seed(1)
Categories <- as.factor(sort(make.unique(rep(letters, length.out = 47), sep='')))
cat <- Categories[c(7,42,43,24,45,26,35,6,15)]
My_Data <- data.frame(
name = make.unique(rep(c(1:10,LETTERS), length.out = 36), sep=''),
cat = sample(rep(cat,4),36,replace = FALSE),
position = 0
)
# solution
My_Data %>%
mutate(row = 1:n()) %>%
group_by(cat) %>%
mutate(inter = row - lag(row) - 1) %>%
summarize(min_inter = min(inter, na.rm = T), max_inter = max(inter, na.rm = T))
Result: 结果:
# A tibble: 9 x 3
cat min_inter max_inter
<fctr> <dbl> <dbl>
1 c1 0 10
2 d 4 11
3 h 0 8
4 l1 0 6
5 m1 1 3
6 r 0 16
7 u1 2 5
8 v 1 23
9 x 6 15
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.