[英]New column which counts the number of times a value in a specific row of one column appears in another column
I have tried searching for an answer to this question but it continues to elude me.我曾尝试寻找这个问题的答案,但它仍然让我难以捉摸。 I am working with crime data where each row refers to a specific crime incident, There is a variable for suspect ID.
我正在处理犯罪数据,其中每一行都代表一个特定的犯罪事件,嫌疑人 ID 有一个变量。 and a variable for victim ID, These ID numbers are consistent across the two columns (in other words, if a row contains the ID 424 in the victim ID column, and a separate row contains the ID 424 in the suspect column. I know that the same person was listed as a victim in the first crime and as a suspect in the second crime).
和一个用于受害者 ID 的变量,这些 ID 号在两列中是一致的(换句话说,如果一行包含受害者 ID 列中的 ID 424,而另一行包含可疑列中的 ID 424。我知道同一个人在第一次犯罪中被列为受害者,在第二次犯罪中被列为嫌疑人)。
I want to create two new variables: one which counts the number of times the victim (in a particular crime incident) has been recorded as a suspect (in the dataset as a whole), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in the dataset as a whole).我想创建两个新变量:一个计算受害者(在特定犯罪事件中)被记录为嫌疑人的次数(在整个数据集中),一个计算嫌疑人的次数(在特定犯罪事件中)已被记录为受害者(在整个数据集中)。
Here's a simplified version of my data:这是我的数据的简化版本:
s.uid ![]() |
v.uid ![]() |
|
---|---|---|
1 ![]() |
1 ![]() |
9 ![]() |
2 ![]() |
2 ![]() |
8 ![]() |
3 ![]() |
3 ![]() |
2 ![]() |
4 ![]() |
4 ![]() |
2 ![]() |
5 ![]() |
5 ![]() |
2 ![]() |
6 ![]() |
NA![]() |
7 ![]() |
7 ![]() |
5 ![]() |
6 ![]() |
8 ![]() |
9 ![]() |
5 ![]() |
And here is what I want to create:这是我想要创建的:
s.uid ![]() |
v.uid ![]() |
s.in.v ![]() |
v.in.s ![]() |
|
---|---|---|---|---|
1 ![]() |
1 ![]() |
9 ![]() |
0 ![]() |
1 ![]() |
2 ![]() |
2 ![]() |
8 ![]() |
3 ![]() |
0 ![]() |
3 ![]() |
3 ![]() |
2 ![]() |
0 ![]() |
1 ![]() |
4 ![]() |
4 ![]() |
2 ![]() |
0 ![]() |
1 ![]() |
5 ![]() |
5 ![]() |
2 ![]() |
1 ![]() |
1 ![]() |
6 ![]() |
NA![]() |
7 ![]() |
NA![]() |
0 ![]() |
7 ![]() |
5 ![]() |
6 ![]() |
1 ![]() |
0 ![]() |
8 ![]() |
9 ![]() |
5 ![]() |
1 ![]() |
2 ![]() |
Note that, where there is an NA, I would like the NA to be preserved.请注意,如果有 NA,我希望保留 NA。 I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
我目前正在尝试在可能的情况下使用 tidyverse 和管道,所以我更喜欢这种格式的答案,但我愿意接受任何解决方案!
Using dplyr
:使用
dplyr
:
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = sum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = sum(dat$s.uid %in% v.uid))
# A tibble: 8 × 4
# Groups: v.uid [6]
s.uid v.uid s.in.v v.in.s
<int> <int> <int> <int>
1 1 9 0 1
2 2 8 3 0
3 3 2 0 1
4 4 2 0 1
5 5 2 1 1
6 NA 7 0 0
7 5 6 1 0
8 9 5 1 2
First, a reprex of your data:首先,您的数据的代表:
library(tidyverse)
# Replica of your data:
s.uid <- c(1:5, NA, 5, 9)
v.uid <- c(9, 8, 2, 2, 2, 7, 6, 5)
DF <- tibble(s.uid, v.uid)
Custom function to use:自定义 function 使用:
# function to check how many times "a" (a length 1 atomic vector) occurs in "b":
f <- function(a, b) {
a <- as.character(a)
# make a lookup table a.k.a dictionary of values in b:
b_freq <- table(b, useNA = "always")
# if a is in b, return it's frequency:
if (a %in% names(b_freq)) {
return(b_freq[a])
}
# else (ie. a is not in b) return 0:
return(0)
}
# vectorise that, enabling intake of any length of "a":
ff <- function(a, b) {
purrr::map_dbl(.x = a, .f = f, b = b)
}
Finally:最后:
DF |>
mutate(
s_in_v = ff(s.uid, v.uid),
v_in_s = ff(v.uid, s.uid)
)
Results in:结果是:
#> # A tibble: 8 × 4
#> s.uid v.uid s_in_v v_in_s
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 0 1
#> 2 2 8 3 0
#> 3 3 2 0 1
#> 4 4 2 0 1
#> 5 5 2 1 1
#> 6 NA 7 NA 0
#> 7 5 6 1 0
#> 8 9 5 1 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.