简体   繁体   中英

Reshape the dataframe with keeping unique values in r

I have question, how to remove duplicate values for a single timestamp. I have a big data having millions of rows. This is how my sample dataframe with problem looks like:

     Name <-c('PP_1','PP_1','PP_1','PP_1','PP_1')
     category<-c('GT','GT','GT','GT','GT')
     year<-c('2025','2025','2025','2025','2025')
     month<-c('12','12','12','12','12')
     day <-c('30','30','30','30','30')
     period<-c('1','1','1','1','1')
     value<-c('53.55','0.00','0.00','0.00','0.00')
     df<-data.frame(Name,category,year,month,day,period,value)
     df<-transform(df, Name = as.character(Name),category =  as.character(category),year = as.integer(year),
          month = as.integer(month),day = as.integer(day),period = as.numeric(period),value = as.numeric(value))

How can I get rid of these unwanted multiple value (here zeros) for the same timetsamp? Like, I would like to keep highest value eg '53.55' and remove all zeros for the same time period The final df supposed to be looked like

Name <-c('PP_1')
 category<-c('GT')
 year<-c('2025')
 month<-c('12')
 day <-c('30')
 period<-c('1')
 value<-c('53.55')
 df<-data.frame(Name,category,year,month,day,period,value)

There are multiple Names in dataframe and values for the entire year and when I use reshape_df<- tidyr::spread(df,Name,value) it gives me Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 1032 rows Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 1032 rows . I was trying with df%>% gather(Name,year, month, day, period, value) function but no luck. Could someone help me to get the correct solution? thanks in advance.

How about subset ?

subset(df, subset=!duplicated(cbind(Name, category, year, month, day, period)))
#  Name category year month day period value
#1 PP_1       GT 2025    12  30      1 53.55

This will keep the first record of each combination of the variables specified. If you must use dplyr, then try filter :

library(dplyr)
filter(df, !duplicated(cbind(Name, year, month, day, period)))

The definition of "uniqueness" will depend on what variables you put in the filter.

You could use

library(dplyr)

df %>%
  group_by(across(-value)) %>%
  mutate(value = as.numeric(as.character(value))) %>%
  filter(value==max(value), .preserve = TRUE)

which returns

  Name  category year  month day   period value
  <fct> <fct>    <fct> <fct> <fct> <fct>  <dbl>
1 PP_1  GT       2025  12    30    1       53.6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM