简体   繁体   English

测量数据帧的负/正偏度

[英]Measure negative/positive skewness of a dataframe

I am looking for a method to check whether data is skewed left or right using Spark.我正在寻找一种方法来检查数据是否使用 Spark 向左或向右倾斜。 Following example gives the same stat for skewness.以下示例给出了相同的偏度统计数据。

>>> from pyspark.sql import functions as f
>>> val1 = [('2018-01-01',20),('2018-02-01',100),('2018-03-01',50),('2018-04-01',0),('2018-05-01',0),('2018-06-01',0),('2018-07-01',0),('2018-08-01',0),('2018-09-01',0)]
>>> val2 = [('2018-01-01',0),('2018-02-01',0),('2018-03-01',0),('2018-04-01',0),('2018-05-01',0),('2018-06-01',0),('2018-07-01',20),('2018-08-01',100),('2018-09-01',50)]
>>> columns = ['date','value']
>>> val1_df = spark.createDataFrame(val1, columns)
>>> val1_df.agg(f.skewness("value")).show()
+-----------------+
|  skewness(value)|
+-----------------+
|1.646145420937772|
+-----------------+

>>> val2_df = spark.createDataFrame(val2, columns)
>>> val2_df.agg(f.skewness("value")).show()
+------------------+
|   skewness(value)|
+------------------+
|1.6461454209377715|
+------------------+

Is there any method to get the positive or negative skewness based on "date" column in spark?是否有任何方法可以根据 spark 中的“日期”列获得正偏度或负偏度?

Both those vectors have the same distribution, so your skew will be the same这两个向量具有相同的分布,因此您的偏斜将相同

from scipy.stats import skew

val1 = [20,100,50,0,0,0,0,0,0]
skew(val1)

Out[6]: 1.646145420937772

val2 = [0,0,0,0,0,0,50,100,20]
skew(val2)

Out[7]: 1.646145420937772

If you replace the zeroes in the second vector with 100 then the distribution will skew to the left如果用 100 替换第二个向量中的零,则分布将向左倾斜

val2 = [100,100,100,100,100,100,50,100,20]
skew(val2)

Out[9]: -1.5578824286327273

In pyspark在 pyspark

from pyspark.sql import functions as f

val1 = [(20,100),(100,100),(50,100),(0,100),(0,100),(0,0),(0,50),(0,100),(0,20)]
cols = ['val1','val2']
df = spark.createDataFrame(val1, cols)
display(df.select(f.skewness(df['val1']),f.skewness(df['val2'])))

skewness(val1)  | skewness(val2)
1.6461454209377713 |-0.9860224906700872

Skewness is a statistical moment, it is a quantitative way to identify whether a distribution is skewed positively or negatively and by how much.偏度是一个统计时刻,它是一种定量方式来确定分布是正偏还是负偏斜以及偏斜程度。 It is a univariate method.它是一种单变量方法。 There are multivariate skewness and kurtosis but its more complicated Check this out有多元偏度和峰度,但它更复杂看看这个

What you are asking for is a qualitative analysis of the distribution.您要求的是对分布进行定性分析。 For your multivariate analysis, you could use the Chi square test or Royston's H test.对于多变量分析,您可以使用卡方检验或 Royston 的 H 检验。 Or you can just bucket the values with respect to date and visually look at it.或者,您可以根据日期对值进行分组并直观地查看它。

If you want a analytical result, you could bucket the values by date, sort by values descending and find which 3 or 4 dates has the top 3 or 4 bucketted values.如果您想要分析结果,您可以按日期对值进行分桶,按值降序排序,然后找出哪 3 或 4 个日期具有前 3 或 4 分桶值。 You can find out which quarter the dates fall in by defining a QTR lookup table and comparing against that.您可以通过定义 QTR 查找表并与之进行比较来找出日期属于哪个季度。 This will give you an idea whether its to the end of the year or beginning of the year.这会让你知道它是到年底还是年初。 If the top dates are all over the place, then the distribution most likely is independent of the dates.如果顶部日期到处都是,那么分布很可能与日期无关。

Calculate the mean and median.计算平均值和中位数。

When mean is bigger than the median, there's a positive or right skewed distribution;当均值大于中位数时,存在正或右偏分布; When mean, median and mode are identical, there's a normal distribution (bell curve);当均值、中位数和众数相同时,存在正态分布(钟形曲线); When mean is lower than the median, there's a negative or left skewed distribution.当平均值低于中位数时,存在负或左偏分布。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM