简体繁体 English

估计Python中的关节分布并采样给定的响应变量

[英]Estimate joint distribution in Python and sample given response variable

原文 2013-11-13 19:19:22 0 1 python/ numpy/ distribution/ probability

I have a sequence of samples from a function Y = f(X) for which there are d random variables, X_1 , X_2 ... X_d and a response variable Y with settings for X as x_1 , x_2 , ... x_d and finally for Y as y (Y is real valued). 我有一个函数Y = f(X)的样本序列，其中有d随机变量X_1 ， X_2 ... X_d和一个响应变量Y其中X设置为x_1 ， x_2 ，... x_d ，最后Y为y （Y为实数值）。 I store these samples in a matrix of dimension (nxd) , and the responses in a vector (dx 1) . 我将这些样本存储在维度(nxd)的矩阵中，并将响应存储在向量(dx 1) 。

I want to calculate the joint distribution in Python in such a way that upon receiving new samples I can update the distribution painlessly. 我想以某种方式计算Python中的联合分布，以便在接收到新样本后可以轻松地更新分布。

Most importantly, I want to be able to sample X settings from my own calculated distribution conditioned on Y , that is -- pick a desired value Y = y and choose from a conditioned, weighted joint distribution a likely set of settings for X given that choice of Y = y . 最重要的是，我希望能够从以Y为条件的自己计算的分布中采样X设置，即-选择一个期望值Y = y并从条件加权联合分布中选择X的可能设置集，因为Y = y选择。

Some variables are categorical and some ordinal, but I am fine with discretizing them to integers (ie, X_i in set of {'red', 'blue', 'green'} => {1, 2, 3}) if needed. 有些变量是分类变量，有些是序数变量，但是如果需要，我可以将它们离散为整数（即{{red“，” blue“，” green“} => {1、2、3}中的X_i ）。

Doing it is easy enough for small d , but higher up gets more difficult. 对于小d ，这样做很容易，但是向上运动会变得更加困难。 What solutions or frameworks if any exist for this workflow in Python? Python中是否存在针对此工作流程的解决方案或框架？ Maybe making my own with numpy isn't so bad? 也许用numpy制作自己的东西还不错吗？ Example code? 示例代码？ My knowledge of statistics is very, very little, but I'm quite solid at Python. 我对统计的知识非常非常少，但是我在Python方面非常扎实。

1 个解决方案

In general this kind of problem belongs to the field of machine learning , a nice python package you might want to check out is scikit-learn . 通常，这种问题属于机器学习领域，您可能要检查的一个不错的python软件包是scikit-learn 。 But if all you need is that sampling, simpler structures would do: 但是，如果您只需要采样，则可以使用更简单的结构：

keep a list of all (x,y)-pairs, ordered by y. 保留所有（x，y）对的列表，按y排序。 rbtree is quite handy here. rbtree在这里非常方便。
When you need a sampling of (x,y)-pairs at some value y=y0, search that list for that value y0, and return a selection of pairs from around that position. 当您需要在某个值y = y0上对（x，y）对进行采样时，请在该列表中搜索该值y0，然后从该位置附近返回一组对。