简体   繁体   中英

find the best curve to fit a family of curves using R

I have a process which generates a set of numbers (< 1) at each run. the process is run till the cumulative sum of the numbers generated equals 1. So each set might have different count of the numbers generated. But the sum total of each set is 1.

There are thousands of runs of the process. I can plot the runs with cum-sum of the numbers, there are multiple curves with each curve corresponding to a run.

For 50 runs: 50 次运行的输出图

For 2000 runs: 在此处输入图像描述

As you can see, the curves have a definite shape and its not a random output. I want to find the best fit equation to this group of curves.

How can I do this in R? Most of the best fit curve solutions are for fitting against a single set of data.

here is the code to generate sample data with 5 runs.

run_group <- c('A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group')

cumul <- c(0.052631579, 0.263157895, 0.342105263, 0.710526316, 0.868421053, 0.894736842, 0.973684211, 1, 0.0078125, 0.015625, 0.0390625, 0.0546875, 0.0703125, 0.1015625, 0.1640625, 0.3203125, 0.4921875, 0.734375, 0.875, 0.96875, 0.9921875, 1, 0.073529412, 0.220588235, 0.323529412, 0.507352941, 0.727941176, 0.970588235, 1, 0.006134969, 0.055214724, 0.141104294, 0.190184049, 0.349693252, 0.595092025, 0.858895706, 0.969325153, 1, 0.005649718, 0.011299435, 0.016949153, 0.039548023, 0.073446328, 0.124293785, 0.299435028, 0.451977401, 0.559322034, 0.728813559, 0.81920904, 0.960451977, 1)

time_diff_to_complete <- c(-155, -140, -125, -110, -95, -80, -65, -50, -270, -210, -195, -180, -165, -150, -135, -120, -105, -90, -75, -60, -45, -30, -130, -115, -100, -85, -70, -55, -40, -175, -160, -130, -115, -100, -85, -70, -55, -40, -225, -210, -195, -180, -150, -135, -120, -105, -90, -75, -60, -45, -30)

sample_data <- data.frame(run_group, cumul, time_diff_to_complete, stringsAsFactors=FALSE)

Just stack them. The curves look like Gaussian cdf's so we fit to pnorm . (The logistic cdf, plogis , would likely also work.)

x <- sample_data$time_diff_to_complete
o <- order(x) 
st <- list(a = mean(x), b = sd(x))

fm <- nls(cumul ~ pnorm(time_diff_to_complete, a, b), sample_data[o, ], start = st)

plot(cumul ~ time_diff_to_complete, sample_data)
lines(fitted(fm) ~ time_diff_to_complete, sample_data[o, ])

The fit looks like this:

截屏

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM