I have a dataframe like this with the names of marine fish species in one of the column and their respective BIN in another column (which is a sort of ID for each species). Sometimes a single BIN number can correspond to more than one species and I want to check which species correspond to a single BIN and which ones have more than one BIN for each single species in each row. I'm sorry if I'm being confusing but I'm very lost in how I can do this. Thank you in advance for any suggestion
1. species BIN
2. Tilapia guineensis BOLD:AAL5979
3. Tilapia zillii BOLD:AAB9042
4. Fundulus rubrifrons BOLD:AAI7245
5. Eutrigla gurnardus BOLD:AAC0262
6. Sprattus sprattus BOLD:AAE9187
7. Gadus morhua BOLD:ACF1143
8. Clupea harengus BOLD:AAB7944
(...)
With dplyr
, you can do (I used sample data with a species having two BINs):
df %>%
group_by(species) %>%
summarise(occurrence = n_distinct(BIN),
BIN = paste(unique(BIN), collapse = ","))
species occurrence BIN
<chr> <int> <chr>
1 Clupea_harengus 1 BOLD:AAB7944
2 Eutrigla_gurnardus 2 BOLD:AAC0262,BOLD:AAE9187
3 Fundulus_rubrifrons 1 BOLD:AAI7245
4 Gadus_morhua 1 BOLD:ACF1143
5 Sprattus_sprattus 1 BOLD:AAE9187
6 Tilapia_guineensis 1 BOLD:AAL5979
7 Tilapia_zillii 1 BOLD:AAB9042
It counts the number of BINs per "species" and combines together the name of unique BINs belonging to a species.
Sample data:
df <- read.table(text = "species BIN
2 Tilapia_guineensis BOLD:AAL5979
3 Tilapia_zillii BOLD:AAB9042
4 Fundulus_rubrifrons BOLD:AAI7245
5 Eutrigla_gurnardus BOLD:AAC0262
6 Eutrigla_gurnardus BOLD:AAE9187
7 Sprattus_sprattus BOLD:AAE9187
8 Gadus_morhua BOLD:ACF1143
9 Clupea_harengus BOLD:AAB7944", header = TRUE,
stringsAsFactors = FALSE)
Another option in tidyverse
would be to get the distinct
rows, grouped by 'species', summarise
the 'occurrence' as number of rows ( n()
) and use str_c
(from stringr
- part of the tidyverse
packages - which would also give a different behavior when there is NA
element) to collapse
the elements into a single string
library(dplyr)
library(stringr)
df %>%
distinct() %>%
group_by(species) %>%
summarise(occurrence = n(),
BIN = str_c(unique(BIN), collapse = ","))
# A tibble: 7 x 3
# species occurrence BIN
# <chr> <int> <chr>
#1 Clupea_harengus 1 BOLD:AAB7944
#2 Eutrigla_gurnardus 2 BOLD:AAC0262,BOLD:AAE9187
#3 Fundulus_rubrifrons 1 BOLD:AAI7245
#4 Gadus_morhua 1 BOLD:ACF1143
#5 Sprattus_sprattus 1 BOLD:AAE9187
#6 Tilapia_guineensis 1 BOLD:AAL5979
#7 Tilapia_zillii 1 BOLD:AAB9042
If there are NA
elements, the behavior is slightly different (unless we take care of the NA
s first)
paste(c(NA, 'a', 'b'), collapse=",")
#[1] "NA,a,b"
str_c(c(NA, 'a', 'b'), collapse=",")
#[1] NA
df <- structure(list(species = c("Tilapia_guineensis", "Tilapia_zillii",
"Fundulus_rubrifrons", "Eutrigla_gurnardus", "Eutrigla_gurnardus",
"Sprattus_sprattus", "Gadus_morhua", "Clupea_harengus"), BIN = c("BOLD:AAL5979",
"BOLD:AAB9042", "BOLD:AAI7245", "BOLD:AAC0262", "BOLD:AAE9187",
"BOLD:AAE9187", "BOLD:ACF1143", "BOLD:AAB7944")),
class = "data.frame", row.names = c("2",
"3", "4", "5", "6", "7", "8", "9"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.