简体   繁体   中英

R: optimize finding max values of function on a data frame and then trim the rest

First of all my data comes from Temperature.xls which can be downloaded from this link: RBook

My code is this:

temp = read.table("Temperature.txt", header = TRUE)
length(unique(temp$Year)) # number of unique values in the Year vector.
res = ddply(temp, c("Year","Month"), summarise, Mean = mean(Temperature, na.rm = TRUE))
res1 = ddply(temp, .(Year,Month), summarise,
    SD = sd(Temperature, na.rm = TRUE),
    N = sum(!is.na(Temperature))
         )
# ordering res1 by sd and year:
res1 = res1[order(res1$Year,res1$SD),];
# finding maximum of SD in res1 by year and displaying just them in a separate data frame
res1_maxsd = ddply(res1, .(Year), summarise, MaxSD = max(SD, na.rm = TRUE)) # find the maxSD in each Year
res1_max = merge(res1_maxsd,res1, all = FALSE) # merge it with the original to see other variables at the max's rows
res1_m = res1_max[res1_max$MaxSD==res1_max$SD,] # find which rows are the ones corresponding to the max value
res1_mm = res1_m[complete.cases(res1_m),] # trim all others (which are NA's)

I know that I can cut the 4 last lines to less lines. Can I somehow execute the last 2 lines in one command? I have stumbled across:

res1_m = res1_max[complete.cases(res1_max$MaxSD==res1_max$SD),]

But this does not give me what I want which is eventually a smaller data frame only with the rows (with all the variables) that contain the maxSD.

Rather than fixing the last 2 lines why not start with res1? Reversing the order of SD and taking the first row per year gives you an equivalent final data set...

res1 <- res1[order(res1$Year,-res1$SD),]
res_final <- res1[!duplicated(res1$Year),]

The last four lines can be cut down if you use dplyr package. Since you want to keep some information from original data set, you probably don't want to use summarise because it only returns summarized information and you have to merge with original dataset, so mutate and filter would be a better choice:

library(dplyr)
res1_mm1 <- res1 %>% group_by(Year) %>% filter(SD == max(SD, na.rm = T))

You can also use a mutate function to create the new column MaxSD which is the same as SD in the result data frame for your case:

res1_mm1 <- res1 %>% group_by(Year) %>% mutate(MaxSD = max(SD, na.rm = T)) %>% 
            filter(SD == MaxSD)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM