[英]How to categorize according to a list of named vectors (~ontology)
Simply put, I have a data frame containing in each row an item type: 简而言之,我有一个数据框,每行包含一个项目类型:
df <- data.frame(
item = 1:5,
type = c("apple", "orange", "onion", "lettuce", "chicken")
)
I want to categorize each item into a hierarchically higher category, which is defined by the type, according to a list of possible types for each category. 我想根据每个类别的可能类型列表,将每个项目归类到由类型定义的更高层次的类别。 I know all the possible types (or can extract them with df$type %>% levels()
). 我知道所有可能的类型(或可以使用df$type %>% levels()
提取它们)。
1) How should I structure the "ontology"/"dictionary" listing all possible values for each category? 1)我应该如何构造列出所有类别所有可能值的“本体” /“词典”? I thought about a list of named lists, but I am not sure what would be the best way to do that. 我考虑过一个命名列表列表,但是我不确定什么是最好的方法。
ontology = c(
"fruit" = c("apple", "orange", "banana"),
"vegetable" = c("onion", "lettuce", "tomato"),
"meat" = c("chicken", "beef")
)
2) How should I create a variable category
in my data frame categorizing each type? 2)如何在数据框中创建将每种类型分类的变量category
?
# Basic attempt...
df %>%
mutate(category = str_match(type %in% ontology))
Expected result: 预期结果:
df
# item type category
# 1 apple fruit
# 2 orange fruit
# 3 onion vegetable
# 4 lettuce vegetable
# 5 chicken meat
Here is a base R method with match
, unlist and gsub
. 这是带有match
,unlist和gsub
的基本R方法。
# flatten ontology list to named atomic vector where name is category with added digit
flat <- unlist(ontology)
# match position of df$type in flat ontology, pull out name, and remove numeric digit
df$category <- sub("\\d+$", "", names(flat)[match(df$type, flat)])
df
item type category
1 1 apple fruit
2 2 orange fruit
3 3 onion vegetable
4 4 lettuce vegetable
5 5 chicken meat
You could turn ontology
into a lookup table: 您可以将ontology
转换为查找表:
library(tidyverse)
df <- data.frame(
item = 1:5,
type = c("apple", "orange", "onion", "lettuce", "chicken")
)
lookup <- list( # use list to avoid suffixes on names
"fruit" = c("apple", "orange", "banana"),
"vegetable" = c("onion", "lettuce", "tomato"),
"meat" = c("chicken", "beef")
) %>%
imap(~set_names(rep_along(.x, .y), .x)) %>% # reverse names and objects
flatten_chr() # simplify to character vector
lookup
#> apple orange banana onion lettuce tomato
#> "fruit" "fruit" "fruit" "vegetable" "vegetable" "vegetable"
#> chicken beef
#> "meat" "meat"
which makes categorizing just a matter of subsetting: 这使得分类仅是子集的问题:
df %>% mutate(category = lookup[type])
#> item type category
#> 1 1 apple fruit
#> 2 2 orange vegetable
#> 3 3 onion vegetable
#> 4 4 lettuce fruit
#> 5 5 chicken fruit
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.