More efficient way than row by row calculation of Pandas Dataframe

Question

In my current project I'm doing data analysis of forest inventory data and fitting statistical distributions to the data by Max Likelihood Estimation.

I calculate the results of each dataset for each required subset of data and get back the estimated distribution parameters and other metrics I need and then store them all in a pandas dataframe.

So far I do all this in a big for loop through each subset of data and then assign the results row by row to the dataframe.

What I want to know is there a more efficient way of doing this? I also don't want to have lots of copies of the data as often I have a million data points or so.

I have created a very simplified example with artifical data and without the max likelihood estimation, but that shows the basic structure

import pandas as pd
import scipy as sp

import numpy.random as sprd

def Gen_UniformDist(seed=5, size=1000000):
    """ Create a set of random numbers uniformly distributed between 0 and 1 """
    sprd.seed(seed)    
    return sprd.uniform(size=size)

# Generate some test data
dataSet = Gen_UniformDist()

# Create an array of truncation vales
truncValue_arr = sp.linspace(0., 0.9, 20)

df_Output = pd.DataFrame(index=truncValue_arr, columns=['mean', 'NumObs'])

for i, truncValue in enumerate(truncValue_arr):
    # Truncate the data using the truncation value
    truncated_DataSet = dataSet[ dataSet >= truncValue]

    # In my real code the function here is more complex max likelihood 
    # rather than simple mean used for simplicity here
    mean = sp.mean(truncated_DataSet)

    numObs = len(truncated_DataSet)

    # Real code would calculate more than 2 values for each row
    df_Output.iloc[i] = [mean, numObs]

What I would like to do is fill the dataframe efficiently without the for loop but also avoid having lots of copies of the data around. Is this possible?

Answer 1

There are 2 aspects of your algorithm which can be optimized straight away:

Replace for loop with a list comprehension.
Instead of repeated iloc calls, build a list of tuples and feed to pd.DataFrame directly.

Here's some pseudo-code:

def return_values(data):
    return sp.mean(data), len(data.index)

L = [return_values(dataSet[dataSet >= truncValue]) for truncValue in truncValue_arr]

df = pd.DataFrame(data=L, index=truncValue_arr, columns=['mean', 'NumObs'])

You can optimize further by refactoring dataSet >= truncValue which occurs in each loop. Consider the following:

s = pd.Series([1, 2, 3, 4, 5])
vals = np.array([2, 4])

s[:, None] > vals

array([[False, False],
       [False, False],
       [ True, False],
       [ True, False],
       [ True,  True]], dtype=bool)

You can therefore do something like:

mask = np.array(dataset)[:, None] >= np.array(truncValue_arr)

L = [return_values(dataset.loc[mask[:, i]]) \
     for i, truncValue in enumerate(truncValue_arr)]

More efficient way than row by row calculation of Pandas Dataframe

Question

1 answers

solution1
0 2018-10-03 11:27:58

More efficient way than row by row calculation of Pandas Dataframe

Question

1 answers

solution1 0 2018-10-03 11:27:58

solution1
0 2018-10-03 11:27:58