简体   繁体   中英

Numpy transformation to normal distribution

I have an array of data. I checked if it was normally distributed:

import sys
import scipy
from scipy import stats
from scipy.stats import mstats
from scipy.stats import normaltest

Data = []
for line in open(sys.argv[1]):
    line = line.strip()
    Data.append(float(line))
print scipy.stats.normaltest(Data)

The output was: (36.444648754208075, 1.2193968690198398e-08)

Then, I wrote a small script to normalise the data:

import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray = []
for line in fileopen:
    line = float(line.strip())
    UntransformedArray.append(line)
TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
    print i

And then I checked for normality again using the first script and the output was (36.444648754209595, 1.2193968690189117e-08).

...the same as the previous score, and not normally distributed.

is one of my scripts wrong?

Also, should I mention that the average of my data is 0.056, the numbers range from 0.014 to 0.171 (85 observations), I'm not sure if the fact that the numbers are so small matters.

A sample of the untransformed and transformed data:

Untransformed:

0.055
0.074
0.049
0.067
0.038
0.037
0.045
0.041

Transformed data:

-2.13696814254
-2.11796814254
-2.14296814254
-2.12496814254
-2.15396814254
-2.15496814254
-2.14696814254

Edit 1:

When I edit the code slightly to account for parenthesis being in the wrong place:

TransformedMean = (UntransformedArray - np.mean(UntransformedArray))
TransformedArray = (TransformedMean/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
    print i

The output I get it different:

Example:

-0.0385683544143
0.705333390576
-0.273484694937
0.431264326632
-0.704164652563
-0.743317375984

However, when I check for normality: (36.444648754241328, 1.2193968689995659e-08)

It is still not normally distributed (and is still the exact same score as the other times)?

Edit 2:

I then tried a different method of normalising the data:

import sys
import scipy
from scipy import stats
from scipy.stats import boxcox

Data = [(float(line.strip())) for line in open(sys.argv[1])]
scipy.stats.boxcox(Data)

I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'float'

EDIT 3: Due to comment from user, the problem was understanding the difference in normalising values, versus normalising a distribution.

Edited code:

import sys
import numpy as np

fileopen = open(sys.argv[1])
UntransformedArray = []
for line in fileopen:
    line = float(line.strip())
    UntransformedArray.append(line)

List1 =  np.log(UntransformedArray) 
for i in List1:
    print i

Checking for normalisation: (4.0435072214905938, 0.13242304287973003)

(works in this case, depending on skewness of the data).

Edit 4: Or using a BoxCox transformation:

import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
import numpy as np

Data = []
for line in open(sys.argv[1]):
    line = line.strip()
    Data.append(float(line))

data = scipy.stats.boxcox(np.array(Data))
for i in data[0]:
    print i

Check for normalisation: (2.9085877478631956, 0.23356523218452238)

As expected, subtracting the mean and rescaling to unit variance does not change the shape of the distribution. normaltest correctly returns the same output in both cases, telling you that your data is not normally distributed.

I agree with Thomas. But to be more precise: You are standardizing the distribution of your array! This does not change the shape of the distribution! You might want to use the numpy.histogram() function to get an impression of the distributions!

I think you have fallen prey to the confusing double usage of 'normalization'. On the one hand, normalization is used to describe standardization of variables (getting variables on the same scale - this is what you did). On the other hand, normalization is used to describe attempts of changing the shape of a probability distribution (the scipy.stats.normaltest() is used to check the shape of such distributions). One easy strategy to try to get a distribution more normally is to use a log transformation. numpy.log() might do the trick here, but only if the original distribution is not too skewed.

I came across the same problem. My data was not normal like yours and I had to transform my data to a normal distribution. For transforming your data to normal you should use normal score transform by different methods like as it is described here . You can also use these formulas . I have written a python code for changing your list of elements to normal distribution as follows:

X = [0.055, 0.074, 0.049, 0.067, 0.038, 0.037, 0.045, 0.041]

from scipy.stats import rankdata, norm

newX = norm.ppf(rankdata(x)/(len(x) + 1))
print(newX)

output:
[ 0.4307273   1.22064035  0.1397103   0.76470967 -0.76470967 -1.22064035
-0.1397103  -0.4307273 ]

You can see that your new data is completely normal after this transformation as you can see by QQ plot:

from scipy import stats
import matplotlib.pyplot as plt

ax4 = plt.subplot(111)
res = stats.probplot(newX, plot=plt)
plt.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM