简体   繁体   中英

Measure negative/positive skewness of a dataframe

I am looking for a method to check whether data is skewed left or right using Spark. Following example gives the same stat for skewness.

>>> from pyspark.sql import functions as f
>>> val1 = [('2018-01-01',20),('2018-02-01',100),('2018-03-01',50),('2018-04-01',0),('2018-05-01',0),('2018-06-01',0),('2018-07-01',0),('2018-08-01',0),('2018-09-01',0)]
>>> val2 = [('2018-01-01',0),('2018-02-01',0),('2018-03-01',0),('2018-04-01',0),('2018-05-01',0),('2018-06-01',0),('2018-07-01',20),('2018-08-01',100),('2018-09-01',50)]
>>> columns = ['date','value']
>>> val1_df = spark.createDataFrame(val1, columns)
>>> val1_df.agg(f.skewness("value")).show()
+-----------------+
|  skewness(value)|
+-----------------+
|1.646145420937772|
+-----------------+

>>> val2_df = spark.createDataFrame(val2, columns)
>>> val2_df.agg(f.skewness("value")).show()
+------------------+
|   skewness(value)|
+------------------+
|1.6461454209377715|
+------------------+

Is there any method to get the positive or negative skewness based on "date" column in spark?

Both those vectors have the same distribution, so your skew will be the same

from scipy.stats import skew

val1 = [20,100,50,0,0,0,0,0,0]
skew(val1)

Out[6]: 1.646145420937772

val2 = [0,0,0,0,0,0,50,100,20]
skew(val2)

Out[7]: 1.646145420937772

If you replace the zeroes in the second vector with 100 then the distribution will skew to the left

val2 = [100,100,100,100,100,100,50,100,20]
skew(val2)

Out[9]: -1.5578824286327273

In pyspark

from pyspark.sql import functions as f

val1 = [(20,100),(100,100),(50,100),(0,100),(0,100),(0,0),(0,50),(0,100),(0,20)]
cols = ['val1','val2']
df = spark.createDataFrame(val1, cols)
display(df.select(f.skewness(df['val1']),f.skewness(df['val2'])))

skewness(val1)  | skewness(val2)
1.6461454209377713 |-0.9860224906700872

Skewness is a statistical moment, it is a quantitative way to identify whether a distribution is skewed positively or negatively and by how much. It is a univariate method. There are multivariate skewness and kurtosis but its more complicated Check this out

What you are asking for is a qualitative analysis of the distribution. For your multivariate analysis, you could use the Chi square test or Royston's H test. Or you can just bucket the values with respect to date and visually look at it.

If you want a analytical result, you could bucket the values by date, sort by values descending and find which 3 or 4 dates has the top 3 or 4 bucketted values. You can find out which quarter the dates fall in by defining a QTR lookup table and comparing against that. This will give you an idea whether its to the end of the year or beginning of the year. If the top dates are all over the place, then the distribution most likely is independent of the dates.

Calculate the mean and median.

When mean is bigger than the median, there's a positive or right skewed distribution; When mean, median and mode are identical, there's a normal distribution (bell curve); When mean is lower than the median, there's a negative or left skewed distribution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM