簡體   English   中英

R中ggplot的條件stat_summary

[英]Conditional stat_summary for ggplot in R

如果數據大於某個值,我想在圖表中寫一些條件統計信息。

在Jack Ryan的幫助下( 剪切數據和訪問組以繪制百分位數線 ),我可以創建以下腳本,將數據分組為幾個小時並繪制結果:

# Read example data
A <- read.csv(url('http://people.ee.ethz.ch/~hoferr/download/data-20130812.csv'))

# Libraries
library(doBy)
library(ggplot2)
library(plyr)
library(reshape2)
library(MASS)
library(scales)

# Sample size function
give.n <- function(x){
       return(c(y = min(x) - 0.2, label = length(x)))
}

# Calculate gaps
gaps <- rep(NA, length(A$Timestamp))
times <- A$Timestamp
loss <- A$pingLoss
gap.start <- 1
gap.end <- 1
for(i in 2:length(A$Timestamp))
{ #For all rows
    if(is.na(A$pingRTT.ms.[i]))
    { #Currently no connection
        if(!is.na(A$pingRTT.ms.[i-1]))
        { #Connection lost now
            gap.start <- i
        }
        if(!is.na(A$pingRTT.ms.[i+1]))
        { # Connection restores next time
            gap.end <- i+1
            gaps[gap.start] <- as.numeric(A$Timestamp[gap.end]-A$Timestamp[gap.start], units="secs")
            loss[gap.start] <- gap.end - gap.start
        } 
    }       
}              
H <- data.frame(times, gaps, loss)
H <- H[complete.cases(H),]
C <- H      
C$dates <- strptime(C$times, "%Y-%m-%d %H:%M:%S")
C$h1 <- C$dates$hour

# Calculate percentiles
cuts <- c(1, .75, .5, .25, 0)
c <- ddply(C, .(h1), function (x) { summarise(x, y = quantile(x$gaps, cuts)) } )
c$cuts <- cuts
c <- dcast(c, h1 ~ cuts, value.var = "y")
c.melt <- melt(c, id.vars = "h1")

p <- ggplot(c.h1.melt, aes(x = h1, y = value, color = variable)) +
geom_point(size = 4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,23)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Hour of day") + ylab("Ping gaps [s]")
p

p <- ggplot(c.m1.melt, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 1) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median, angle = 90, size=4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,24)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Time of day") + ylab("Ping gaps [s]")
p

這將創建一個每小時分組的間隙圖,最長間隙的長度寫在數據點旁邊:

每小時分組的圖,每組沒有樣本數

以下是按分鍾分組的圖。 這個數字是無法理解的,為什么我要添加條件統計信息(如果差距超過5分鍾或僅針對十個最長的差距或類似的東西)。

按統計信息細分的圖

我試圖將stat函數更改為

max.n.filt <- function(x){
    filter = 300
    if ( x > filter ) {
      return(c(y = max(x) + 0.4, label = round(max(10^x),2)))
    } else {
        return(c(y=x, label = ""))
    }
}

並將其用於按分鍾分組的圖。 但是我得到了這個錯誤:

Error in list_to_dataframe(res, attr(.data, "split_labels")) : 
  Results do not have equal lengths
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
Calls: print ... print.ggplot -> ggplot_gtable -> Map -> mapply -> <Anonymous>
In addition: Warning message:
Removed 6 rows containing missing values (geom_point).

此外,在小時圖中,我想在缺口長度的旁邊寫出每小時的樣本數量。 我想我可以在c數據框中添加一個新列,但是不幸的是我找不到一種方法。

很感謝任何形式的幫助。

請參閱?stat_summary。

fun.data:完整的摘要功能。 應將數據幀作為輸入,並將返回數據幀作為輸出

您的函數max.n.filt使用if()語句嘗試評估條件x > filter 但是,當length(x) > 1if()語句僅計算x的第一個值的條件。 當在數據幀上使用時,這將返回一個列表,該列表從原始輸入xif()語句返回的任何標簽拼湊而成。

> max.n.filt(data.frame(x=c(10,15,400)))
$y.x
[1]  10  15 400

$label
[1] ""

嘗試使用ifelse()代替的函數:

max.n.filt2 <- function(x){
    filter = 300                  # whatever threshold
    y = ifelse( x > filter, max(x) + 1, x[,1] )
    label = ifelse( x > filter, round(max(x),2), NA )
    return(data.frame(y=y[,1], label=label[,1]))
}

> max.n.filt2(data.frame(x=c(10,15,400)))
    y label
1  10    NA
2  15    NA
3 401   400

另外,您可能會發現使用geom_text()更容易。 我無法重現您的示例,但這是一個模擬數據集:

set.seed(101)
sim_data <- expand.grid(m1=1:1440, variable=factor(c(0,0.25,0.5,0.75,1)))
sim_data$sample_size <- sapply(1:1440, function(.) sample(1:25, 1, replace=T))
sim_data$value = t(sapply(1:1440, function(.) quantile(rgamma(sim_data$sample_size, 0.9, 0.5),c(0,0.25,0.5,0.75,1))))[1:(1440*5)]

只需在geom_text()使用subset參數來選擇要標記的點:

ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=round(value)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.5)

如果您有一列樣本量,則可以使用paste()將其合並到label

ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=paste(round(value),", N=",sample_size)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.25)

(或在數據中使用所需的標簽創建單獨的列。)如果您要詢問如何檢索樣本量,可以按以下方式修改對ddply()調用:

...
c2 <- ddply(C, .(h1), function (x) { cbind(summarise(x, y = quantile(x$gaps, cuts)), n=nrow(x)) } )
c2$cuts <- cuts
c2 <- dcast(c2, h1 + n ~ cuts, value.var = "y")
c2.h1.melt <- melt(c2, id.vars = c("h1","n"))
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM