I have a table with a list of categories each with a count value that i'd like to collapse across based on similarity ... for example Mariner-1_Amel and Mariner-10 would be a single category of Mariner and anything with 'Jockey' or 'hAT' in the name should be collapsed across.
I'm struggling to find a solution that can cope with all the possibilities. Is there an easy dplyr solution?
reproducible with
> dput(tibs)
structure(list(type = c("(TTAAG)n_1", "AMARI_1", "Copia-4_LH-I",
"DNA", "DNA-1_CQ", "DNA/hAT-Charlie", "DNA/hAT-Tip100", "DNA/MULE-MuDR",
"DNA/P", "DNA/PiggyBac", "DNA/TcMar-Mariner", "DNA/TcMar-Tc1",
"DNA/TcMar-Tigger", "G3_DM", "Gypsy-10_CFl-I", "hAT-1_DAn", "hAT-16_SM",
"hAT-N4_RPr", "HELITRON7_CB", "Jockey-1_DAn", "Jockey-1_DEl",
"Jockey-12_DF", "Jockey-5_DTa", "Jockey-6_DYa", "Jockey-6_Hmel",
"Jockey-7_HMM", "Jockey-8_Hmel", "LINE/Dong-R4", "LINE/I", "LINE/I-Jockey",
"LINE/I-Nimb", "LINE/Jockey", "LINE/L1", "LINE/L2", "LINE/R1",
"LINE/R2", "LINE/R2-NeSL", "LINE/Tad1", "LTR/Gypsy", "Mariner_CA",
"Mariner-1_AMel", "Mariner-10_HSal", "Mariner-13_ACe", "Mariner-15_HSal",
"Mariner-16_DAn", "Mariner-19_RPr", "Mariner-30_SM", "Mariner-39_SM",
"Mariner-42_HSal", "Mariner-46_HSal", "Mariner-49_HSal", "TE-5_EL",
"Unknown", "Utopia-1_Crp"), n = c(1L, 1L, 1L, 2L, 1L, 18L, 3L,
9L, 2L, 8L, 21L, 12L, 18L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 7L, 2L, 7L, 24L, 1L, 1L, 5L, 3L, 1L,
1L, 7L, 1L, 5L, 1L, 1L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 2L, 1L, 190L,
1L)), row.names = c(NA, -54L), class = c("tbl_df", "tbl", "data.frame"
))
It seems to me that your broader types are mostly/entirely at the beginning of the string. You could therefore use just the first alphanumerical sequence ( [[:alnum:]]+
) of the type as broader types. This would give you the following types:
library(tidyverse)
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
count(type_short, sort = TRUE)
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 Mariner 12
#> 2 LINE 11
#> 3 DNA 10
#> 4 Jockey 8
#> 5 hAT 3
#> 6 AMARI 1
#> 7 Copia 1
#> 8 G3 1
#> 9 Gypsy 1
#> 10 HELITRON7 1
#> 11 LTR 1
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 1
#> 15 Utopia 1
You can easily use the new column to group_by
:
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
group_by(type_short) %>%
summarise(n = sum(n))
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 AMARI 1
#> 2 Copia 1
#> 3 DNA 94
#> 4 G3 1
#> 5 Gypsy 3
#> 6 hAT 5
#> 7 HELITRON7 1
#> 8 Jockey 10
#> 9 LINE 54
#> 10 LTR 7
#> 11 Mariner 35
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 190
#> 15 Utopia 1
Theoretically, you could also try to use string similarity here. Yet your types do not have great similarity among themselves. A relative Levenshtein distance (distance / characters of the longer string) for example retrieves results like this:
strings <- c("Mariner-1_Amel", "Mariner-10")
adist(strings) / max(nchar(strings))
#> [,1] [,2]
#> [1,] 0.0000000 0.3571429
#> [2,] 0.3571429 0.0000000
This could be interpreted as the two types being 36% similar. Finding a good threshold might be hard in that case.
This solution uses package dplyr
function case_when
and base R grepl
.
library(dplyr)
tibs %>%
mutate(category = case_when(
grepl("hAT|Jockey", type) ~ "Jokey",
grepl("Mariner", type) ~ "Mariner",
grepl("DNA", type) ~ "DNA",
grepl("LINE", type) ~"LINE",
TRUE ~ as.character(type)
),
category = factor(category)
)
If there is no commonality to define the groups you can define individual conditions using case_when
.
library(dplyr)
library(stringr)
tibs %>%
mutate(category = case_when(str_detect(type, 'Mariner-\\d+') ~ 'Mariner',
str_detect(type, 'Jockey|hAT') ~ 'common',
#Add more conditions
))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.