简体   繁体   English

使用NumPy在Python中向量化一系列CDF样本

[英]Vectorizing a series of CDF samples in Python with NumPy

I am in the process of writing a basic financial program with Python where daily expenses are read in as a table and are turned into a PDF (Probability Density Function) and eventually a CDF (Cummulative Distribution Function) that ranges from 0 to 1 using the build in histogram capability of NumPy. 我正在用Python编写基本的财务程序,该程序将日常支出作为表读取,并转换为PDF(概率密度函数),最后变成CDF(累积分布函数),使用内置NumPy的直方图功能。 I am trying to randomly sample a daily expense by comparing a random number ranging from 0 to 1 with the CDF array and an array of the CDF center points and using the interp1d functionality of SciPy to determine the interpolated value. 我试图通过比较随机数从0到1与CDF数组和CDF中心点数组并使用SciPy的interp1d功能确定内插值来随机抽样每日支出。 I have successfully implemented this algorithm using a for loop, but it is way to slow and am trying to convert it to a vectorized format. 我已经使用for循环成功实现了该算法,但是它是一种降低速度的方法,正在尝试将其转换为矢量格式。 I am including an example of the code that does work with a for loop and my attempt thus far in vectorizing the algorithm. 我将提供一个与for循环一起使用的代码示例,到目前为止,我还尝试了对算法进行矢量化处理。 I would greatly appreciate any advice on how I can make the vectorized version work and increase the execution speed of the code. 对于如何使矢量化版本工作并提高代码执行速度的任何建议,我将不胜感激。

Sample input file: 输入文件样本:

12.00    March 01, 2014
0.00     March 02, 2014
0.00     March 03, 2014
0.00     March 04, 2014
0.00     March 05, 2014
0.00     March 06, 2014
44.50    March 07, 2014
0.00     March 08, 2014
346.55   March 09, 2014
168.18   March 10, 2014
140.82   March 11, 2014
10.83    March 12, 2014
0.00     March 13, 2014
0.00     March 14, 2014
174.00   March 15, 2014
0.00     March 16, 2014
0.00     March 17, 2014
266.53   March 18, 2014
0.00     March 19, 2014
110.00   March 20, 2014
0.00     March 21, 2014
0.00     March 22, 2014
44.50    March 23, 2014

for loop version of code (that works but is too slow) for循环版本的代码(有效,但速度太慢)

#!usr/bin/python
import pandas as pd
import numpy as np
import random
import itertools
import scipy.interpolate

def Linear_Interpolation(rand,Array,Array_Center):
    if(rand < Array[0]):
        y_interp = scipy.interpolate.interp1d((0,Array[0]),(0,Array_Center[0]))
    else:
        y_interp = scipy.interpolate.interp1d(Array,Array_Center)

    final_value = y_interp(rand)
    return (final_value)

#--------- Main Program --------------------
# - Reads the file in and transforms the first column of float variables into
#   an array titled MISC_DATA
File1 = '../../Input_Files/Histograms/Static/Misc.txt'
MISC_DATA = pd.read_table(File1,header=None,names = ['expense','month','day','year'],sep = '\s+')

# Creates the PDF bin heights and edges
Misc_hist, Misc_bin_edges = np.histogram(MISC_DATA['expense'],bins=60,normed=True)
# Creates the CDF bin heights
Misc = np.cumsum(Misc_hist*np.diff(Misc_bin_edges))
# Creates an array of the bin center points along the x axis
Misc_Center = (Misc_bin_edges[:-1] + Misc_bin_edges[1:])/2

iterator = range(0,100)
for cycle in iterator:
    MISC_EXPENSE = Linear_Interpolation(random.random(),Misc,Misc_Center)
    print MISC_EXPENSE

I am trying to vectorize the for loop in the manner shown below and convert the variable MISC_EXPENSE from a scalar into an array, but it is not working. 我正在尝试以如下所示的方式对for循环进行向量化,并将变量MISC_EXPENSE从标量转换为数组,但无法正常工作。 It tells me that the truth value of an array with more than one element is ambiguous. 它告诉我,具有多个元素的数组的真值是不明确的。 I think it is referring to the fact that the array of random variables 'rand_var' has a different dimension than the arrays 'Misc' and 'Misc_Center'. 我认为这是指随机变量数组“ rand_var”与数组“ Misc”和“ Misc_Center”具有不同的维数。 Any suggestions are appreciated. 任何建议表示赞赏。

rand_var = np.random.rand(100)
MISC_EXPENSE = Linear_Interpolation(rand_var,Misc,Misc_Center)

If I understood your example correct, the code creates one interpolation object per random number, which is slow. 如果我正确理解了您的示例,则代码会为每个随机数创建一个插值对象,这很慢。 However, the interp1d can take a vector of values to be interpolated. 但是, interp1d可以采用要插值的向量。 And the starting zero should be in the CDF in any case I assume: 在我假设的任何情况下,起始零都应位于CDF中:

y_interp = scipy.interpolate.interp1d(
    np.concatenate((np.array([0]), Misc)),
    np.concatenate((np.array([0]), Misc_Center))
)


new_vals = y_interp(np.random.rand(100))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM