简体   繁体   English

有没有更有效的方法来处理在 R 数据帧中重复的事实?

[英]Is there a more efficient way to handle facts which are duplicating in an R dataframe?

I have a dataframe which looks like this:我有一个看起来像这样的数据框:

ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")

df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)

The dataframes dimensions work like this:数据框维度的工作方式如下:

  • There will always be an ID/key which singularly and uniquely identifies a submitted fact总会有一个 ID/key 唯一地和唯一地标识提交的事实
  • There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.给定事实总是有一个维度来定义提交的事实所属的 Total_Category。
  • Most of the time - but not always - there will be a dimension for a "Descriptor",大多数时候——但并非总是如此——“描述符”会有一个维度,
  • If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".如果一个“描述”尺寸对于一个给定的事实,就会有另一个“成员”的尺寸,以示“描述”中可能的成员。

The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact.问题在于,根据应用于给定事实的维度数量,针对给定 ID 重复提交的单个事实。 What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.我想要的是一种根据其 ID 仅显示一次事实的方法,并将适用的维度存储在该单个 ID 上。

I've achieved it by doing this:我通过这样做实现了它:

df1 <- pivot_wider(df, 
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")

ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()


df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")

But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.但是由于pivot_wide,它似乎不能很好地扩展具有多个维度的事实,并且通常看起来不是一种非常有效的方法。

Is there a better way to do this?有一个更好的方法吗?

I think you want simple paste with sep and collapse arguments我认为你想要带有sepcollapse参数的简单paste

library(dplyr, warn.conflicts = F)

df %>% group_by(ID, Fact) %>%
  summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')

# A tibble: 3 x 3
     ID  Fact Descriptor                                                            
  <dbl> <dbl> <chr>                                                                 
1     1   233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown  
2     2    50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual 
3     3    15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic

You can unite the columns and for each ID combine them together and take average of Fact values.您可以unite的列和每个ID它们组合在一起,并采取平均的Fact值。

library(dplyr)
library(tidyr)

df %>%
  unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
  group_by(ID) %>%
  summarise(Descriptor = paste0(Descriptor, collapse = '_'), 
            mean_sel = mean(Fact, na.rm = TRUE))

#     ID Descriptor                                               mean_sel
#  <dbl> <chr>                                                       <dbl>
#1     1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas…      233
#2     2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans…       50
#3     3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi…       15

An option with str_c str_c一个选项

library(dplyr)
library(stringr)
df %>%
   group_by(ID, Fact) %>%
   summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM