I have a dataset that looks like below (figure1):
Now I want to add a new colunm named "team_diversity" which is to show how many different countries within a team. so the new dataframe should look like below (figure2):
note: I don't want to count if the country column contains NAs. eg, if one team consists of students from USA, CHINA, NA, JAPAN, then the team_diversity column should show number 3 rather than 4.
Could anyone help me with a sample code? a detailed explanation would be a big plus. Thank you in advance!
using data.table
:
dt = as.data.table(df)
dt[,`:=`(team_diversity=uniqueN(country)),by=c("team_id")]
uniqueN()
counts the unique values in the vector/column passed to it and by=c("team_id")
will group by team_id
the values of country
being passed to uniqueN
.
Does this answer:
> df %>% group_by(`team id`) %>% mutate(`team diversity` = length(unique(na.omit(country))))
# A tibble: 12 x 4
# Groups: team id [3]
`team id` `student id` country `team diversity`
<dbl> <dbl> <chr> <int>
1 1 11 USA 3
2 1 12 CHINA 3
3 1 13 Japan 3
4 1 14 USA 3
5 2 21 KOREA 2
6 2 22 KOREA 2
7 2 23 AUSTRALIA 2
8 3 31 USA 4
9 3 32 BRAZIL 4
10 3 33 USA 4
11 3 34 JAPAN 4
12 3 35 CHINA 4
Data used:
> dput(df)
structure(list(`team id` = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3,
3), `student id` = c(11, 12, 13, 14, 21, 22, 23, 31, 32, 33,
34, 35), country = c("USA", "CHINA", "Japan", "USA", "KOREA",
"KOREA", "AUSTRALIA", "USA", "BRAZIL", "USA", "JAPAN", "CHINA"
)), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"
))
>
Maybe this approach can be useful:
library(dplyr)
#Data
df <- data.frame(team_id=c(rep(1,4),rep(2,3),rep(3,5)),
student_id=c(11:14,21:23,31:35),
country=c('USA','CHINA','Japan','USA','KOREA','KOREA','AUSTRILIA',
'USA','BRAZIL','USA','JAPAN','CHINA'),stringsAsFactors = F)
#Code
df %>% group_by(team_id) %>% mutate(team_diversity=n_distinct(country[!is.na(country)]))
Output:
# A tibble: 12 x 4
# Groups: team_id [3]
team_id student_id country team_diversity
<dbl> <int> <chr> <int>
1 1 11 USA 3
2 1 12 CHINA 3
3 1 13 Japan 3
4 1 14 USA 3
5 2 21 KOREA 2
6 2 22 KOREA 2
7 2 23 AUSTRILIA 2
8 3 31 USA 4
9 3 32 BRAZIL 4
10 3 33 USA 4
11 3 34 JAPAN 4
12 3 35 CHINA 4
you can use dplyr package. I think using this would be the simplest way to achieve what you are looking for.
library(dplyr)
data <- read.csv(file.choose())
data %>%
group_by(team_id) %>% #grouped by team_id
mutate(team_diversity = n_distinct(country)) #unique country within a particular team_id
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.