Curve fitting with cubic spline

Question

I am trying to interpolate a cumulated distribution of eg i) number of people to ii) number of owned cars, showing that eg the top 20% of people own much more than 20% of all cars - off course 100% of people own 100% of cars. Also I know that there are eg 100mn people and 200mn cars.

Now coming to my code:

#import libraries (more than required here)
import pandas as pd
from scipy import interpolate
from scipy.interpolate import interp1d
from sympy import symbols, solve, Eq
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from scipy import interpolate

curve=pd.read_excel('inputs.xlsx',sheet_name='inputdata')

Input data: Curveplot (cumulated people (x) on the left // cumulated cars (y) on the right)

#Input data in list form (I am not sure how to interpolate from a list for the moment)
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]

x, y = points[:,0], points[:,1]
interpolation = interp1d(x, y, kind = 'cubic')

number_of_people_mn= 100000000

oneperson = 1 / number_of_people_mn
dataset = pd.DataFrame(range(number_of_people_mn + 1))
dataset.columns = ["nr_of_one_person"]
dataset.drop(dataset.index[:1], inplace=True)

#calculating the position of every single person on the cumulated x-axis (between 0 and 1)
dataset["cumulatedpeople"] = dataset["nr_of_one_person"] / number_of_people_mn

#finding the "cumulatedcars" to the "cumulatedpeople" via interpolation (between 0 and 1)
dataset["cumulatedcars"] = interpolation(dataset["cumulatedpeople"])

plt.plot(dataset["cumulatedpeople"], dataset["cumulatedcars"])
plt.legend(['Cubic interpolation'], loc = 'best')
plt.xlabel('Cumulated people')
plt.ylabel('Cumulated cars')
plt.title("People-to-car cumulated curve")
plt.show()

However when looking at the actual plot, I get the following result which is false: Cubic interpolation

In fact, the curve should look almost like the one from a linear interpolation with the exact same input data - however this is not accurate enough for my purpose: Linear interpolation

Is there any relevant step I am missing out or what would be the best way to get an accurate interpolation from the inputs that almost looks like the one from a linear interpolation?

Answer 1

Short answer: your code is doing the right thing, but the data is unsuitable for cubic interpolation.

Let me explain. Here is your code that I simplified for clarity

from scipy.interpolate import interp1d
from matplotlib import pyplot as plt

cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
interpolation = interp1d(cumulatedpeople, cumulatedcars, kind = 'cubic')

number_of_people_mn= 100#000000
cumppl = np.arange(number_of_people_mn + 1)/number_of_people_mn
cumcars = interpolation(cumppl)
plt.plot(cumppl, cumcars)
plt.plot(cumulatedpeople, cumulatedcars,'o')
plt.show()

note the last couple of lines -- I am plotting, on the same graph, both the interpolated results and the input date. Here is the result

orange dots are the original data, blue line is cubic interpolation. The interpolator passes through all the points so technically is doing the right thing

Clearly it is not doing what you would want

The reason for such strange behavior is mostly at the right end where you have a few x-points that are very close together -- the interpolator produces massive wiggles trying to fit very closely spaced points.

If I remove two right-most points from the interpolator:

interpolation = interp1d(cumulatedpeople[:-2], cumulatedcars[:-2], kind = 'cubic')

it looks a bit more reasonable:

But still one would argue linear interpolation is better. The wiggles on the left end now because the gaps between initial x-poonts are too large

The moral here is that cubic interpolation should really be used only if gaps between x points are roughly the same

Your best bet here, I think, is to use something like curve_fit

a related discussion can be found here

specifically monotone interpolation as explained here yields good results on your data. Copying the relevant bits here, you would replace the interpolator with

from scipy.interpolate import pchip
interpolation = pchip(cumulatedpeople, cumulatedcars)

and get a decent-looking fit:

Curve fitting with cubic spline

Question

1 answers

solution1
2 ACCPTED 2020-11-11 21:15:35

Curve fitting with cubic spline

Question

1 answers

solution1 2 ACCPTED 2020-11-11 21:15:35

solution1
2 ACCPTED 2020-11-11 21:15:35