In my current project I'm doing data analysis of forest inventory data and fitting statistical distributions to the data by Max Likelihood Estimation.
I calculate the results of each dataset for each required subset of data and get back the estimated distribution parameters and other metrics I need and then store them all in a pandas dataframe.
So far I do all this in a big for loop through each subset of data and then assign the results row by row to the dataframe.
What I want to know is there a more efficient way of doing this? I also don't want to have lots of copies of the data as often I have a million data points or so.
I have created a very simplified example with artifical data and without the max likelihood estimation, but that shows the basic structure
import pandas as pd
import scipy as sp
import numpy.random as sprd
def Gen_UniformDist(seed=5, size=1000000):
""" Create a set of random numbers uniformly distributed between 0 and 1 """
sprd.seed(seed)
return sprd.uniform(size=size)
# Generate some test data
dataSet = Gen_UniformDist()
# Create an array of truncation vales
truncValue_arr = sp.linspace(0., 0.9, 20)
df_Output = pd.DataFrame(index=truncValue_arr, columns=['mean', 'NumObs'])
for i, truncValue in enumerate(truncValue_arr):
# Truncate the data using the truncation value
truncated_DataSet = dataSet[ dataSet >= truncValue]
# In my real code the function here is more complex max likelihood
# rather than simple mean used for simplicity here
mean = sp.mean(truncated_DataSet)
numObs = len(truncated_DataSet)
# Real code would calculate more than 2 values for each row
df_Output.iloc[i] = [mean, numObs]
What I would like to do is fill the dataframe efficiently without the for loop but also avoid having lots of copies of the data around. Is this possible?
There are 2 aspects of your algorithm which can be optimized straight away:
for
loop with a list comprehension. iloc
calls, build a list of tuples and feed to pd.DataFrame
directly. Here's some pseudo-code:
def return_values(data):
return sp.mean(data), len(data.index)
L = [return_values(dataSet[dataSet >= truncValue]) for truncValue in truncValue_arr]
df = pd.DataFrame(data=L, index=truncValue_arr, columns=['mean', 'NumObs'])
You can optimize further by refactoring dataSet >= truncValue
which occurs in each loop. Consider the following:
s = pd.Series([1, 2, 3, 4, 5])
vals = np.array([2, 4])
s[:, None] > vals
array([[False, False],
[False, False],
[ True, False],
[ True, False],
[ True, True]], dtype=bool)
You can therefore do something like:
mask = np.array(dataset)[:, None] >= np.array(truncValue_arr)
L = [return_values(dataset.loc[mask[:, i]]) \
for i, truncValue in enumerate(truncValue_arr)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.