简体   繁体   English

如何根据命名载体列表进行分类(〜本体)

[英]How to categorize according to a list of named vectors (~ontology)

Simply put, I have a data frame containing in each row an item type: 简而言之,我有一个数据框,每行包含一个项目类型:

df <- data.frame(
  item = 1:5,
  type = c("apple", "orange", "onion", "lettuce", "chicken")
)

I want to categorize each item into a hierarchically higher category, which is defined by the type, according to a list of possible types for each category. 我想根据每个类别的可能类型列表,将每个项目归类到由类型定义的更高层次的类别。 I know all the possible types (or can extract them with df$type %>% levels() ). 我知道所有可能的类型(或可以使用df$type %>% levels()提取它们)。

1) How should I structure the "ontology"/"dictionary" listing all possible values for each category? 1)我应该如何构造列出所有类别所有可能值的“本体” /“词典”? I thought about a list of named lists, but I am not sure what would be the best way to do that. 我考虑过一个命名列表列表,但是我不确定什么是最好的方法。

ontology = c(
  "fruit" = c("apple", "orange", "banana"),
  "vegetable" = c("onion", "lettuce", "tomato"),
  "meat" = c("chicken", "beef")
)

2) How should I create a variable category in my data frame categorizing each type? 2)如何在数据框中创建将每种类型分类的变量category

# Basic attempt...
df %>%
  mutate(category = str_match(type %in% ontology))

Expected result: 预期结果:

df
# item    type  category
#    1   apple     fruit
#    2  orange     fruit
#    3   onion vegetable
#    4 lettuce vegetable
#    5 chicken      meat

Here is a base R method with match , unlist and gsub . 这是带有match ,unlist和gsub的基本R方法。

# flatten ontology list to named atomic vector where name is category with added digit
flat <- unlist(ontology)
# match position of df$type in flat ontology, pull out name, and remove numeric digit
df$category <- sub("\\d+$", "", names(flat)[match(df$type, flat)])
df
  item    type  category
1    1   apple     fruit
2    2  orange     fruit
3    3   onion vegetable
4    4 lettuce vegetable
5    5 chicken      meat

You could turn ontology into a lookup table: 您可以将ontology转换为查找表:

library(tidyverse)

df <- data.frame(
  item = 1:5,
  type = c("apple", "orange", "onion", "lettuce", "chicken")
)

lookup <- list(    # use list to avoid suffixes on names
    "fruit" = c("apple", "orange", "banana"),
    "vegetable" = c("onion", "lettuce", "tomato"),
    "meat" = c("chicken", "beef")
) %>% 
    imap(~set_names(rep_along(.x, .y), .x)) %>%    # reverse names and objects
    flatten_chr()    # simplify to character vector

lookup
#>       apple      orange      banana       onion     lettuce      tomato 
#>     "fruit"     "fruit"     "fruit" "vegetable" "vegetable" "vegetable" 
#>     chicken        beef 
#>      "meat"      "meat"

which makes categorizing just a matter of subsetting: 这使得分类仅是子集的问题:

df %>% mutate(category = lookup[type])
#>   item    type  category
#> 1    1   apple     fruit
#> 2    2  orange vegetable
#> 3    3   onion vegetable
#> 4    4 lettuce     fruit
#> 5    5 chicken     fruit

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM