[英]Distributing value in a column to values in two other columns based on certain filtering
I am currently working on a programming puzzle that sounds straightforward, but apparently it is pretty difficult if I want to do this efficiently in R without having to use for
loop to go through a column with 100k+ rows within a data-frame.我目前正在研究一个听起来很简单的编程难题,但显然如果我想在 R 中有效地做到这一点而不必使用
for
循环来遍历数据框中包含 100k+ 行的列,这将非常困难。 I am trying to apply dplyr
(particularly group_by
and mutate
) or data.table
, and -apply
family, but it's quite tough.我正在尝试应用
dplyr
(尤其是group_by
和mutate
)或data.table
和-apply
系列,但这非常困难。 Could anyone give some help?任何人都可以提供一些帮助吗?
The problem is as follows: given a data-frame df
with columns key
("string" data type) x
, y
, and z
("numeric" data type).问题如下:给定一个数据帧
df
,其列key
(“字符串”数据类型) x
、 y
和z
(“数字”数据类型)。 Some elements within column key
are repeated.列
key
中的某些元素是重复的。 Among rows with the same element in key
column, check if the corresponding value in column x
is smaller than the sum of corresponding elements in column y
(row-wise).在
key
列中具有相同元素的行中,检查x
列中的对应值是否小于y
列中对应元素的总和(逐行)。 If it is, then turn that value in column x
to 0, while distributing the element in column x
to elements in column y
based on the ordering of corresponding values in column z
.如果是,则将
x
列中的该值变为 0,同时根据z
列中相应值的顺序将x
列中的元素分配给y
列中的元素。 How do we effectively do this given that we need to go through all distinct elements in column key
?鉴于我们需要遍历列
key
中的所有不同元素,我们如何有效地做到这一点?
Input输入
df <- data.frame(key = c('aa_bb_1, aa_bb_0, ab_ca_0, abc_bbb_1, abbbc_aa_1, aaa_ccc_1, aa_bb_1, aa_bb_1, ab_ca_0, abc_bbb_1, abbbc_aa_1, aaa_ccc_1, aa_bb_0, aa_bb_1, ab_ca_0, abc_bbb_0, abbbc_aa_0, aaa_ccc_1, aa_bb_0, aa_bb_1, ab_ca_1, abc_bbb_1, abbbc_aa_1, aaa_ccc_1, aa_bb_1, aa_bb_0, ab_ca_0, abc_bbb_1, abbbc_aa_1, aaa_ccc_1),
x = c(10, 19, 30, 25, 37, 13, 30, 40, 100, 53, 11, 27, 89, 21, 30, 30, 17, 9, 5, 57, 10, 19, 30, 25, 37, 13, 30, 40, 100, 53, 11, 27, 89, 21, 30, 30, 17, 9, 5, 57, 10, 19, 30, 25, 37, 13, 30, 40, 100, 53),
y = (3, 10, 18, 15, 32, 4, 6, 29, 71, 92, 11, 7, 21, 19, 13, 26,28,11,8, 8, 5, 23, 3, 12, 19, 7, 9, 11, 7, 12, 9, 3, 20, 13, 7, 2, 9, 3, 6, 13, 11, 8, 8, 5, 21, 5, 21,11, 25, 40),
z = (8,13,15,16,10,10,25,21,32,15,45,8,10,50,12,10,0,0,10,12,2,40,9,8,13,15,16,10,10,25,21,32,15,45,8,10,50,12,10,0,0,10,12,2,40,9,12,10,10,20)
key x y z
1 aa_bb_1 10 3 8
2 aa_bb_0 19 10 13
3 ab_ca_0 30 18 15
4 abc_bbb_1 25 15 16
5 abbbc_aa_1 37 32 10
6 aaa_ccc_1 13 4 10
7 aa_bb_1 30 6 25
8 aa_bb_1 40 29 21
9 ab_ca_0 100 71 32
10 abc_bbb_1 53 92 15
11 abbbc_aa_1 11 11 45
12 aaa_ccc_1 27 7 8
13 aa_bb_0 89 21 10
14 aa_bb_1 21 19 50
15 ab_ca_0 30 13 12
16 abc_bbb_0 30 26 10
17 abbbc_aa_0 17 28 0
18 aaa_ccc_1 9 11 0
....
25 aa_bb_1 37 19 13
26 aa_bb_0 13 7 15
27 ab_ca_0 30 9 16
28 abc_bbb_1 40 11 10
29 abbbc_aa_1 100 7 10
30 aaa_ccc_1 53 12 25
Not sure what exactly your outcome looks like不确定你的结果到底是什么样的
With dplyr you can do something like this.使用 dplyr 你可以做这样的事情。 I'm pretty, sure this doesn't exactly solve your issue because of ambiguity of your description.
我很漂亮,由于您的描述含糊不清,这并不能完全解决您的问题。 But you can use this as a template to solve your issue.
但是您可以将此作为模板来解决您的问题。
df |>
group_by(key) |>
mutate(x = ifelse(n() > 1 & (x < sum(y)), 0, x)) |>
ungroup() |>
mutate( y = df |> group_by(key) |> mutate(y= x[order(z)]) |> pull(y))
key x y z
<chr> <dbl> <int> <int>
1 aa_bb_1 0 10 8
2 aa_bb_0 0 89 13
3 ab_ca_0 0 30 15
4 abc_bbb_1 0 53 16
5 abbbc_aa_1 0 37 10
6 aaa_ccc_1 0 9 10
7 aa_bb_1 0 40 25
8 aa_bb_1 0 30 21
9 ab_ca_0 0 30 32
10 abc_bbb_1 0 25 15
11 abbbc_aa_1 0 11 45
12 aaa_ccc_1 27 27 8
13 aa_bb_0 89 19 10
14 aa_bb_1 0 21 50
15 ab_ca_0 0 100 12
16 abc_bbb_0 30 30 10
17 abbbc_aa_0 17 17 0
18 aaa_ccc_1 0 13 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.