简体   繁体   中英

How to get Q-Q plot for Pareto distribution in Python?

QQ plots are used to get the goodness of fit between a set of data points and theoretical distribution. Following is the procedure to get the points.

  1. Select the samples to use. Sort the selected samples with X(i) denoting the ith sample
  2. Find the model values that correspond to the samples. This is done in two steps,

    a. Associate each sample with the percentile it represents. pi = (i-0.5)/n

    b. Calculate the model value that would be associated with this percentile. This is done by inverting the model CDF, as is done when generating random variates from the model distribution. Thus the model value corresponding to sample i is Finverse(pi).

    c. Draw the QQ plot, using the n points

( X(i), Finverse(pi)) 1 ≤ i ≤ n

Using this approach I came up with the following python implementation.

_distn_names = ["pareto"]
def fit_to_all_distributions(data):
    dist_names = _distn_names

    params = {}
    for dist_name in dist_names:
        try:
            dist = getattr(st, dist_name)
            param = dist.fit(data)

            params[dist_name] = param
        except Exception:
            print("Error occurred in fitting")
            params[dist_name] = "Error"

    return params 

def get_q_q_plot(values, dist, params):
    values.sort()

    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    x = []

    for i in range(len(values)):
        x.append((i-0.5)/len(values))

    y = getattr(st, dist).ppf(x, loc=loc, scale=scale, *arg)

    y = list(y)

    emp_percentiles = values
    dist_percentiles = y

    print("Emperical Percentiles")
    print(emp_percentiles)

    print("Distribution Percentiles")
    print(dist_percentiles)

    plt.figure()
    plt.xlabel('dist_percentiles')
    plt.ylabel('actual_percentiles')
    plt.title('Q Q plot')
    plt.plot(dist_percentiles, emp_percentiles)
    plt.savefig("/path/q-q-plot.png")

b = 2.62
latencies = st.pareto.rvs(b, size=500)
data = pd.Series(latencies)
params = fit_to_all_distributions(data)

pareto_params = params["pareto"]

get_q_q_plot(latencies, "pareto", pareto_params)

Ideally I should get a straight line, but this is what I get.

QQ剧情

Why don't I get a straight line? Is there anything wrong in my implementation?

You can get the QQ plot for any distribution (there are 82 in scipy stats) using the following code.

import os
import matplotlib.pyplot as plt
import sys
import math
import numpy as np
import scipy.stats as st
from scipy.stats._continuous_distns import _distn_names
from scipy.optimize import curve_fit

def get_q_q_plot(latency_values, distribution):

    distribution = getattr(st, distribution)
    params = distribution.fit(latency_values)

    latency_values.sort()

    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    x = []

    for i in range(1, len(latency_values)):
        x.append((i-0.5) / len(latency_values))

    y = distribution.ppf(x, loc=loc, scale=scale, *arg)

    y = list(y)

    emp_percentiles = latency_values[1:]
    dist_percentiles = y

    return emp_percentiles, dist_percentiles

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM