简体   繁体   中英

R - How to count number of country of origins based on team id?

I have a dataset that looks like below (figure1):

在此处输入图片说明

Now I want to add a new colunm named "team_diversity" which is to show how many different countries within a team. so the new dataframe should look like below (figure2):

在此处输入图片说明

note: I don't want to count if the country column contains NAs. eg, if one team consists of students from USA, CHINA, NA, JAPAN, then the team_diversity column should show number 3 rather than 4.

Could anyone help me with a sample code? a detailed explanation would be a big plus. Thank you in advance!

using data.table :

dt = as.data.table(df)
dt[,`:=`(team_diversity=uniqueN(country)),by=c("team_id")] 

uniqueN() counts the unique values in the vector/column passed to it and by=c("team_id") will group by team_id the values of country being passed to uniqueN .

Does this answer:

> df %>% group_by(`team id`) %>% mutate(`team diversity` = length(unique(na.omit(country))))
# A tibble: 12 x 4
# Groups:   team id [3]
   `team id` `student id` country   `team diversity`
       <dbl>        <dbl> <chr>                <int>
 1         1           11 USA                      3
 2         1           12 CHINA                    3
 3         1           13 Japan                    3
 4         1           14 USA                      3
 5         2           21 KOREA                    2
 6         2           22 KOREA                    2
 7         2           23 AUSTRALIA                2
 8         3           31 USA                      4
 9         3           32 BRAZIL                   4
10         3           33 USA                      4
11         3           34 JAPAN                    4
12         3           35 CHINA                    4

Data used:

> dput(df)
structure(list(`team id` = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 
3), `student id` = c(11, 12, 13, 14, 21, 22, 23, 31, 32, 33, 
34, 35), country = c("USA", "CHINA", "Japan", "USA", "KOREA", 
"KOREA", "AUSTRALIA", "USA", "BRAZIL", "USA", "JAPAN", "CHINA"
)), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"
))
> 

Maybe this approach can be useful:

library(dplyr)
#Data
df <- data.frame(team_id=c(rep(1,4),rep(2,3),rep(3,5)),
                 student_id=c(11:14,21:23,31:35),
                 country=c('USA','CHINA','Japan','USA','KOREA','KOREA','AUSTRILIA',
                           'USA','BRAZIL','USA','JAPAN','CHINA'),stringsAsFactors = F)
#Code
df %>% group_by(team_id) %>% mutate(team_diversity=n_distinct(country[!is.na(country)]))

Output:

# A tibble: 12 x 4
# Groups:   team_id [3]
   team_id student_id country   team_diversity
     <dbl>      <int> <chr>              <int>
 1       1         11 USA                    3
 2       1         12 CHINA                  3
 3       1         13 Japan                  3
 4       1         14 USA                    3
 5       2         21 KOREA                  2
 6       2         22 KOREA                  2
 7       2         23 AUSTRILIA              2
 8       3         31 USA                    4
 9       3         32 BRAZIL                 4
10       3         33 USA                    4
11       3         34 JAPAN                  4
12       3         35 CHINA                  4

you can use dplyr package. I think using this would be the simplest way to achieve what you are looking for.

library(dplyr)

data <- read.csv(file.choose())

data %>%
  group_by(team_id) %>% #grouped by team_id
  mutate(team_diversity = n_distinct(country)) #unique country within a particular team_id

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM