简体   繁体   English

创建代表数据帧的列,在 R 中切成 20 个偶数组

[英]create column representing the dataframe cut into 20 even groups in R

I am using R to try to create a column in my dataframe called df that splits the data into 20 even groups, with the new column group having the corresponding group for each row.我正在使用 R 尝试在我的数据框中创建一个名为df的列,该列将数据分成 20 个偶数组,新的列group具有每行对应的组。 An example of my ordered data looks as such:我的有序数据示例如下所示:

                preds ground_truth
65378  0.000002975379            0
27082  0.000004721652            0
26890  0.000006613435            1
130498 0.000007634303            0
173319 0.000007834359            0
20039  0.000009482496            0
64722  0.000009482496            0
53924  0.000009482496            0
165543 0.000009482496            0

I have asked a similar question before and there are similar answers, however the solutions do not work for some reason.我之前问过一个类似的问题,也有类似的答案,但是由于某种原因,这些解决方案不起作用。 The other answers are here:其他答案在这里:

Splitting a continuous variable into equal sized groups R divide data into groups 将连续变量分成大小相等的组R 将数据分成组

My solution was to use cut as such:我的解决方案是像这样使用 cut :

  df$group <- cut(index(df), 20, labels = FALSE)

I expected this to cut the dataframe index into 20 even groups, thus over the 129844 rows, there would be 6492 in each group.我预计这会将数据帧索引切成 20 个偶数组,因此在 129844 行中,每组中有 6492 个。 However this only produces a singular group, not splitting the data at all.然而,这只会产生一个单一的组,根本不会拆分数据。 Could someone explain why cut here is not working, where it has for the other dataframes?有人可以解释为什么 cut 在这里不起作用,它对其他数据帧有什么作用?

Any extra information I would be happy to supply,我很乐意提供任何额外的信息,

EDIT: I need the data groupings to be in order with respect to preds eg the first group will contain the highest 6492 values, the second the next highest 6492 and so on.编辑:我需要将数据分组按照 preds 的顺序进行,例如,第一组将包含最高的 6492 值,第二组将包含下一个最高的 6492,依此类推。

The data grouping must be ordered in the sense that the top group will Here is a dput of the first 10 rows:数据分组必须按照顶部组的顺序进行排序 这里是前 10 行的 dput:

structure(list(preds = c(0.00000297537922317814, 
0.00000472165221855588, 
0.0000066134351160987, 0.00000763430272198875, 0.00000783435945631941, 
0.00000948249581302744, 0.00000948249581314139, 0.00000948249581314247, 
0.00000948249581314704, 0.0000094824958131879), ground_truth = 
structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = 
"factor")), .Names = c("preds", 
"ground_truth"), row.names = c("65378", "27082", "26890", "130498", 
"173319", "20039", "64722", "53924", "165543", "168952"), class = 
"data.frame")

How about just using some modular math?只使用一些模块化数学怎么样?

If we had a data frame with 129844 rows:如果我们有一个包含 129844 行的数据框:

df <- data.frame(a = runif(129844))

We can get each row assigned to one of 20 evenly-sized groups labelled 1 to 20 like this:我们可以将每一行分配给 20 个大小均匀的组之一,标记为 1 到 20,如下所示:

df$group <- factor(1 + (seq(nrow(df)) - 1) %/% (nrow(df) / 20))

And to prove it:并证明这一点:

table(df$group)

#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
#> 6493 6492 6492 6492 6492 6493 6492 6492 6492 6492 6493 6492 6492 6492 6492 6493 6492 6492 6492 6492

Obviously 129844 is not evenly divisible by 20, so we have 4 groups that contain 6493 members.显然 129844 不能被 20 整除,所以我们有 4 个包含 6493 个成员的组。

For equal-sized and ordered groups we can use ntile from the dplyr package:对于相同大小的和有序组,我们可以使用ntiledplyr包:

df <- df %>%
  arrange(preds) %>%
  mutate(group = ntile(preds, 20))

              preds ground_truth group
65378  2.975379e-06            0     1
27082  4.721652e-06            0     2
26890  6.613435e-06            0     3
130498 7.634303e-06            0     4
173319 7.834359e-06            0     5
20039  9.482496e-06            0     6
64722  9.482496e-06            0     7
53924  9.482496e-06            0     8
165543 9.482496e-06            0     9
168952 9.482496e-06            0    10

As your sample only consists of 10 rows, there are just 10 groups.由于您的样本仅包含 10 行,因此只有 10 个组。 It should work for your whole data frame.它应该适用于您的整个数据框。 Or see cut_number from the ggplot2 package:或者从ggplot2包中查看cut_number

df$group2 <- cut_number(df$preds, 20, labels = c(1:20))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM