简体   繁体   English

如何通过R dataframe中每个id的百分位数对有序数据进行二进制[r]

[英]How to bin ordered data by percentile for each id in R dataframe [r]

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). 我的数据框包含70-80行有序响应时间(rt)数据,每个228个人,每个人都有一个唯一的ID#(每个人都没有相同的行数)。 I want to bin each person's RTs into 5 bins. 我想把每个人的RT分成5个箱子。 I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd). 我希望第一个bin是他们最快20%的RT,第二个bin是他们下一个最快的20%RT,等等。每个bin应该有相同数量的试验(除非试用的总数是奇数)。

My current dataframe looks like this: 我目前的数据框如下所示:

id     RT
7000   225
7000   250
7000   253
7001   189
7001   201
7001   225

I'd like my new dataframe to look like this: 我希望我的新数据框看起来像这样:

id   RT    Bin
7000  225    1
7000  250    1

After getting my data to look like this, I will aggregate by id and bin 在我的数据看起来像这样后,我将按id和bin聚合

The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. 我能想到的唯一方法是将数据拆分为一个列表(使用split命令),遍历每个人,使用quantile命令获取不同bin的断点,分配bin值(1- 5)每个响应时间。 This feels very convoluted (and would be difficult for me). 这感觉非常复杂(对我来说很难)。 I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. 我有点陷入困境,我非常感谢如何简化这一过程。 Thanks. 谢谢。

The answer @Chase gave split the range into 5 groups of equal length (difference of endpoints). 答案@Chase将范围分成5组相等长度(端点差异)。 What you seem to want is pentiles (5 groups with equal number in each group). 你似乎想要的是pentiles(每组5组,数量相同)。 For that, you need the cut2 function in Hmisc 为此,您需要在Hmisc使用cut2函数

library("plyr")
library("Hmisc")

dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))

tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))

tmp now has what you want tmp现在有你想要的

> tmp
    id       value hists
1    1  0.19016791     3
2    1  0.27795226     4
3    1  0.74350982     5
4    1  0.43459571     4
5    1 -2.72263322     1
....
95  10 -0.10111905     3
96  10 -0.28251991     2
97  10 -0.19308950     2
98  10  0.32827137     4
99  10 -0.01993215     4
100 10 -1.04100991     1

With the same number in each hists for each id 每个id每个hists数字相同

> table(tmp$id, tmp$hists)

     1 2 3 4 5
  1  2 2 2 2 2
  2  2 2 2 2 2
  3  2 2 2 2 2
  4  2 2 2 2 2
  5  2 2 2 2 2
  6  2 2 2 2 2
  7  2 2 2 2 2
  8  2 2 2 2 2
  9  2 2 2 2 2
  10 2 2 2 2 2

Here's a reproducible example using package plyr and the cut function: 这是一个使用package plyrcut函数的可重现示例:

dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))

ddply(dat, "id", transform, hists = cut(value, breaks = 5))

    id       value             hists
1    1 -1.82080027     (-1.94,-1.41]
2    1  0.11035796     (-0.36,0.166]
3    1 -0.57487134    (-0.886,-0.36]
4    1 -0.99455189    (-1.41,-0.886]
....
96  10 -0.03376074    (-0.233,0.386]
97  10 -0.71879488   (-0.853,-0.233]
98  10 -0.17533570    (-0.233,0.386]
99  10 -1.07668282    (-1.47,-0.853]
100 10 -1.45170078    (-1.47,-0.853]

Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins. 如果要返回简单的整数值而不是bin,则传入labels = FALSE以进行cut

Here's an answer in plain old R. 这是一个简单的老R.的答案。

#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )

#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)

You'll note that quantile command used can be set to use any quantiles. 您会注意到,使用的quantile命令可以设置为使用任何分位数。 The defaults are for quintiles but if you want deciles then use 默认值适用于五分位数但如果您需要十分位数则使用

quantile(x, seq(0, 1, 0.1))

in the function above. 在上面的功能。

The answer above is a bit fragile. 上面的答案有点脆弱。 It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. 它需要相同数量的RT / id,我没有告诉你如何获得神奇的数字4.但是,它也会在大型数据集上运行得非常快。 If you want a more robust solution in base R. 如果您想在基础R中使用更强大的解决方案

library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))

This is much more robust than the first solution but it isn't as fast. 这比第一个解决方案更强大,但速度并不快。 For small datasets you won't notice. 对于小型数据集,您不会注意到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM