简体   繁体   English

熵python实现

[英]entropy python implementation

I'm trying to rewrite this matlab/octave repo in python.我正在尝试用 python 重写这个 matlab/octave repo I've came up on what seems to be an implementation of a function of entropy (see below).我想出了一个似乎是熵函数的实现(见下文)。 After some research,I've found on google that I can use scipy's entropy implementation for python.经过一番研究,我在 google 上发现我可以使用scipy 的熵实现实现python。 But after reading more on scipy's entropy formula (eg S = -sum(pk * log(pk), axis=0)), I'm having a doubt that those two are computing the same thing...但是在阅读了更多关于 scipy 的熵公式(例如 S = -sum(pk * log(pk), axis=0))之后,我怀疑这两个计算的是同一件事......

Could anyone confirm my thought please?有人可以证实我的想法吗?

%�������� author by YangSong 2010.11.16 C230
%file:ys_sampEntropy.m
% code is called from line 101 of algotrading.m
%  =>   entropy180(i)=ys_sampEntropy(kmeans180s1(i,1:180));
% where kmeans180s1 is an array of size 100x181 containing the kmeans  
% centroids and the prize label at position 181.  

function sampEntropy=ys_sampEntropy(xdata)
m=2;
n=length(xdata);
r=0.2*std(xdata);%ƥ��ģ��������ֵ
%r=0.05;
cr=[];
gn=1;
gnmax=m;
while gn<=gnmax
      d=zeros(n-m+1,n-m);% ���ž��������ľ���  
      x2m=zeros(n-m+1,m);%���ű任�������      
      cr1=zeros(1,n-m+1);%���Ž����ľ���
      k=1;

      for i=1:n-m+1

          for j=1:m
              x2m(i,j)=xdata(i+j-1);
          end

      end
      x2m;

      for i=1:n-m+1

          for j=1:n-m+1

              if i~=j
                 d(i,k)=max(abs(x2m(i,:)-x2m(j,:)));%��������Ԫ�غ���ӦԪ�صľ���
                 k=k+1;
              end

          end

          k=1;
      end
      d;

      for i=1:n-m+1
          [k,l]=size(find(d(i,:)<r));%����RС�ĸ������͸�L
          cr1(1,i)=l;
      end
      cr1;

      cr1=(1/(n-m))*cr1;
      sum1=0;

      for i=1:n-m+1

          if cr1(i)~=0
             %sum1=sum1+log(cr1(i));
             sum1=sum1+cr1(i);
          end  %if����

      end  %for����

      cr1=1/(n-m+1)*sum1;
      cr(1,gn)=cr1;
      gn=gn+1;
      m=m+1;
end        %while����
cr;

sampEntropy=log(cr(1,1))-log(cr(1,2));

The code is pretty unreadable but it is still clear that this is not an implementation of the Shannon entropy calculation for discrete variables, as implemented in scipy.该代码非常难以理解,但仍然很明显,这不是 scipy 中实现的离散变量香农熵计算的实现。 Instead this vaguely looks like the Kozachenko-Leonenko k-nearest neighbour estimator used to estimate the entropy of continuous variables (Kozachenko & Leonenko 1987).相反,这模糊地看起来像用于估计连续变量熵的 Kozachenko-Leonenko k-最近邻估计器 (Kozachenko & Leonenko 1987)。

The basic idea of that estimator is to look at the average distance between neighbouring data points.该估计器的基本思想是查看相邻数据点之间的平均距离。 The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large.直觉是,如果距离很大,则数据中的分散很大,因此熵很大。 In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.在实践中,不是采用最近邻距离,而是倾向于采用 k-最近邻距离,这往往会使估计更加稳健。

The code shows some distance calculations代码显示了一些距离计算

d(i,k)=max(abs(x2m(i,:)-x2m(j,:)));

and there is some counting of points that are nearer than some fixed distance:并且有一些比某个固定距离更近的点计数:

[k,l]=size(find(d(i,:)<r));

However, it is also clear that this is not exactly the Kozachenko-Leonenko estimator but some butchered version of it.然而,同样清楚的是,这不完全是 Kozachenko-Leonenko 估计量,而是它的一些屠宰版本。

If you do end up wanting to compute the Leonenko estimator, I have some code to that effect of my github:如果你最终想要计算 Leonenko 估计器,我有一些代码可以达到我的 github 的效果:

https://github.com/paulbrodersen/entropy_estimators https://github.com/paulbrodersen/entropy_estimators

EDIT:编辑:

After looking some more at this mess, I am no longer sure that he/she is not in fact using (attempting to use?) the classical Shannon information definition for discrete variables, even though the input is continuous:在对这个乱七八糟的东西多看了一些之后,我不再确定他/她实际上没有使用(尝试使用?)离散变量的经典香农信息定义,即使输入是连续的:

  for i=1:n-m+1
      [k,l]=size(find(d(i,:)<r));%����RС�ĸ������͸�L
      cr1(1,i)=l;
  end
  cr1;

  cr1=(1/(n-m))*cr1;

The for loop counts the number of data points closer than r, and then the last line in the snippet divides that number by some interval to get a density. for 循环计算比 r 更近的数据点的数量,然后片段中的最后一行将该数字除以某个间隔以获得密度。

Those densities are then summed below:然后将这些密度相加如下:

  for i=1:n-m+1

      if cr1(i)~=0
         %sum1=sum1+log(cr1(i));
         sum1=sum1+cr1(i);
      end  %if����

  end  %for����

But then we get these bits (again!):但随后我们得到了这些位(再次!):

  cr1=1/(n-m+1)*sum1;
  cr(1,gn)=cr1;

And

sampEntropy=log(cr(1,1))-log(cr(1,2));

My brain refuses to believe that the returned value could be your average log(p) but I am no longer 100% sure.我的大脑拒绝相信返回的值可能是您的平均log(p)但我不再 100% 确定。

Either way, if you want to compute the entropy of a continuous variable, you should either fit a distribution to your data or you should use the Kozachenko-Leonenko estimator.无论哪种方式,如果您想计算连续变量的熵,您应该要么拟合数据的分布,要么使用 Kozachenko-Leonenko 估计量。 And please write better code.并请编写更好的代码。

    ##Entropy
def entropy(Y):
    """
    Also known as Shanon Entropy
    Reference: https://en.wikipedia.org/wiki/Entropy_(information_theory)
    """
    unique, count = np.unique(Y, return_counts=True, axis=0)
    prob = count/len(Y)
    en = np.sum((-1)*prob*np.log2(prob))
    return en


#Joint Entropy
def jEntropy(Y,X):
    """
    H(Y;X)
    Reference: https://en.wikipedia.org/wiki/Joint_entropy
    """
    YX = np.c_[Y,X]
    return entropy(YX)

#Conditional Entropy
def cEntropy(Y, X):
    """
    conditional entropy = Joint Entropy - Entropy of X
    H(Y|X) = H(Y;X) - H(X)
    Reference: https://en.wikipedia.org/wiki/Conditional_entropy
    """
    return jEntropy(Y, X) - entropy(X)


#Information Gain
def gain(Y, X):
    """
    Information Gain, I(Y;X) = H(Y) - H(Y|X)
    Reference: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees#Formal_definition
    """
    return entropy(Y) - cEntropy(Y,X)

This is what I use :这是我使用的:

def entropy(data,bins=None):
    if bins is None : bins =  len(np.unique(data))
    cx = np.histogram(data, bins)[0]
    normalized = cx/float(np.sum(cx))
    normalized = normalized[np.nonzero(normalized)]
    h = -sum(normalized * np.log2(normalized))
    return h





"""
Approximate entropy : used to quantify the amount of regularity and the unpredictability of fluctuations 
over time-series data.

The presence of repetitive patterns of fluctuation in a time series renders it more predictable than a 
time series in which such patterns are absent. 
ApEn reflects the likelihood that similar patterns of observations will not be followed by additional 
similar observations.
[7] A time series containing many repetitive patterns has a relatively small ApEn, 
a less predictable process has a higher ApEn.

    U: time series
    The value of "m" represents the (window) length of compared run of data, and "r" specifies a filtering level.

    https://en.wikipedia.org/wiki/Approximate_entropy

Good m,r values :
   m = 2,3 : rolling window
   r = 10% - 25% of seq std-dev

"""

def approx_entropy(U, m, r):

    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)

    return abs(_phi(m + 1) - _phi(m))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM