简体   繁体   English

使用 R - 根据另一个数据帧的组最大值重塑数据帧

[英]Using R - reshape a dataframe based on group max values of another dataframe

I am working with a very large dataset.我正在处理一个非常大的数据集。 Consider the following example for illustration:考虑以下示例进行说明:

df1<-{data.frame(MyID=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5),v1=c(0.1, 0.2, NA, 0.4, 0.2, 0.1, 0.8, 0.3, 0.1, 0.4, 0.3), v2=c(NA, 0.4, 0.2, 0.1, 0.8, 0.3, 0.1, 0.4, 0.3, 0.1, 0.2))}

df2<-{data.frame(MyID=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5),v1=c(10, 8, 0, 6, 10, 5, 3, 1, 10, 8, 3), v2=c(0, 10, 5, 1, 8, 5,10, 3, 3, 1, 5))}

I would like to extract information from df1 but based on maximum values per MyID in df2.我想从 df1 中提取信息,但基于 df2 中每个 MyID 的最大值。 The final result should be a dataframe with:最终结果应该是一个数据框:

  • one row per a unique MyID每个唯一的 MyID 一行
  • each column would have the value in df1 corresponding the maximum of MyID group of df2.每列将具有 df1 中的值,对应于 df2 的 MyID 组的最大值。

The result should be结果应该是

ExpectedResult<-{data.frame(MyID=c(1, 2, 3, 4, 5),v1=c(0.1,0.2,0.1,0.4,0.3), v2=c(0.1,0.4,0.2,0.1,0.2))}

What I have tried already but solved only a part of the problem:我已经尝试过但只解决了部分问题:

  • using groups and finding max per group, eg df2Max<- df2 %>% group_by(MyID) %>% slice_max(1,)使用组并找到每组的最大值,例如df2Max<- df2 %>% group_by(MyID) %>% slice_max(1,)
  • splitting the data using eg df2.split <- split(df2, list(df2$MyID))使用例如df2.split <- split(df2, list(df2$MyID))拆分数据

But, I am still not sure how to link the two dataframes to extract what I need.但是,我仍然不确定如何链接两个数据框以提取我需要的内容。

We can group_by MyID and get the index of maximum value in each column and store it in df3 .我们可以group_by MyID并获取每列中最大值的索引并将其存储在df3

library(dplyr)

df2 %>%
  group_by(MyID) %>%
  summarise(across(.fns = which.max)) -> df3

We split df3 by row and split df1 by MyID and extract the relevant value using indexing.我们按行拆分df3并按MyID split df1并使用索引提取相关值。

df3[-1] <- t(mapply(function(x, y) x[cbind(y, 1:ncol(x))], 
            split(df1[-1], df1$MyID), asplit(df3[-1], 1)))

#   MyID    v1    v2
#  <dbl> <dbl> <dbl>
#1     1   0.1   0.1
#2     2   0.2   0.4
#3     3   0.1   0.2
#4     4   0.4   0.1
#5     5   0.3   0.2

We get the row index for 'v1', 'v2', column where the values are highest in 'df2' grouped by 'MyID', then do a join with the first dataset by 'MyID' and summarise the 'v1', 'v2' columns based on the index after grouping by 'MyID'我们获得“v1”、“v2”、“df2”中值最高的列的行索引,按“MyID”分组,然后通过“MyID”与第一个数据集连接并summarise “v1”、“ v2'列基于'MyID'分组后的索引

library(dplyr)
df2 %>% 
   group_by(MyID) %>% 
   summarise(rnv1 = row_number()[which.max(v1)], 
             rnv2 = row_number()[which.max(v2)], .groups = 'drop' ) %>%  
   right_join(df1, by = 'MyID') %>%
   group_by(MyID) %>% 
   summarise(v1 = v1[first(rnv1)], v2 = v2[first(rnv2)], .groups = 'drop')

-output -输出

# A tibble: 5 x 3
#   MyID    v1    v2
#  <dbl> <dbl> <dbl>
#1     1   0.1   0.1
#2     2   0.2   0.4
#3     3   0.1   0.2
#4     4   0.4   0.1
#5     5   0.3   0.2

Or another option is a join with data.table或者另一种选择是与data.table的连接

library(data.table)    
nm1 <- names(df2)[-1]
setDT(df1)[setDT(df2)[, lapply(.SD, which.max), MyID], 
    Map(function(x, y) x[first(y)], .SD, mget(paste0("i.", nm1))), 
    on = .(MyID), by = .EACHI]
#   MyID  v1  v2
#1:    1 0.1 0.1
#2:    2 0.2 0.4
#3:    3 0.1 0.2
#4:    4 0.4 0.1
#5:    5 0.3 0.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用基于 R 中的值的列重塑 dataframe - Reshape a dataframe with columns based on values in R 使用 1 列中的值作为新数据框中的列名重塑 R 中的数据框 - Reshape dataframe in R using values in 1 column as column names in new dataframe 根据另一列中的值对 R dataframe 中的列进行分组 - group columns in R dataframe based on values in another column 使用另一个数据框R中的值创建一个数据框 - Creating a dataframe using the values in another dataframe R R - 根据 dataframe 中的条件按组设置值 - R - Set values by group based on a condition in a dataframe 使用dplyr或tidyr根据三列中的值重塑数据框 - Using dplyr or tidyr to reshape dataframe based on values in three columns 如何基于R中另一个数据帧中的值将数据帧中的值保留 - How to keep values in a dataframe based on values in another dataframe in R 根据另一个数据帧R中的值填充数据帧中的缺失值 - Fill missing values in a dataframe based on values from another dataframe R R-基于另一个数据框中的值在一个数据框中创建值的问题 - R- Issue with creating values in a dataframe based on values in another dataframe 根据另一个数据框中各列的范围将一个数据框中的值分组 - Group values in one dataframe based on range in columns in another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM