I have question, how to remove duplicate values for a single timestamp. I have a big data having millions of rows. This is how my sample dataframe with problem looks like:
Name <-c('PP_1','PP_1','PP_1','PP_1','PP_1')
category<-c('GT','GT','GT','GT','GT')
year<-c('2025','2025','2025','2025','2025')
month<-c('12','12','12','12','12')
day <-c('30','30','30','30','30')
period<-c('1','1','1','1','1')
value<-c('53.55','0.00','0.00','0.00','0.00')
df<-data.frame(Name,category,year,month,day,period,value)
df<-transform(df, Name = as.character(Name),category = as.character(category),year = as.integer(year),
month = as.integer(month),day = as.integer(day),period = as.numeric(period),value = as.numeric(value))
How can I get rid of these unwanted multiple value (here zeros) for the same timetsamp? Like, I would like to keep highest value eg '53.55' and remove all zeros for the same time period The final df supposed to be looked like
Name <-c('PP_1')
category<-c('GT')
year<-c('2025')
month<-c('12')
day <-c('30')
period<-c('1')
value<-c('53.55')
df<-data.frame(Name,category,year,month,day,period,value)
There are multiple Names
in dataframe and values
for the entire year and when I use reshape_df<- tidyr::spread(df,Name,value)
it gives me Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 1032 rows
Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 1032 rows
. I was trying with df%>% gather(Name,year, month, day, period, value)
function but no luck. Could someone help me to get the correct solution? thanks in advance.
How about subset
?
subset(df, subset=!duplicated(cbind(Name, category, year, month, day, period)))
# Name category year month day period value
#1 PP_1 GT 2025 12 30 1 53.55
This will keep the first record of each combination of the variables specified. If you must use dplyr, then try filter
:
library(dplyr)
filter(df, !duplicated(cbind(Name, year, month, day, period)))
The definition of "uniqueness" will depend on what variables you put in the filter.
You could use
library(dplyr)
df %>%
group_by(across(-value)) %>%
mutate(value = as.numeric(as.character(value))) %>%
filter(value==max(value), .preserve = TRUE)
which returns
Name category year month day period value
<fct> <fct> <fct> <fct> <fct> <fct> <dbl>
1 PP_1 GT 2025 12 30 1 53.6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.