[英]How to convert an array of values so that each value is closer the mean, but with a similarly shaped distribution (i.e. reduce the stdev) in PySpark
I hope I've described the job I need to do in the correct terms.我希望我已经用正确的术语描述了我需要做的工作。 Essentially, I need to 'compress' a series of values so that all the values are closer to the mean, but their values should be reduced (or increased) relative to their distance from the mean...
本质上,我需要“压缩”一系列值,以便所有值都更接近平均值,但它们的值应该相对于它们与平均值的距离减少(或增加)......
The dataframe looks like this:数据框如下所示:
>>> df[['population', 'postalCode']].show(10)
+----------+----------+
|population|postalCode|
+----------+----------+
| 1464| 96028|
| 465| 96015|
| 366| 96016|
| 5490| 96101|
| 183| 96068|
| 569| 96009|
| 366| 96054|
| 90| 96119|
| 557| 96006|
| 233| 96116|
+----------+----------+
only showing top 10 rows
>>> df.describe().show()
+-------+------------------+------------------+
|summary| population| postalCode|
+-------+------------------+------------------+
| count| 1082| 1082|
| mean|23348.511090573014| 93458.60813308688|
| stddev|21825.045923603615|1883.6307236060127|
+-------+------------------+------------------+
The population mean is about right for my purposes, but I need the variance around it to be smaller...总体均值符合我的目的,但我需要它周围的方差更小......
Hope that makes sense, any help performing this job either in pyspark or node.js greatly appreciated.希望这是有道理的,非常感谢在 pyspark 或 node.js 中执行此工作的任何帮助。
The general idea is to:总体思路是:
In pseudo-code, if your values are stored in the variable x
:在伪代码中,如果您的值存储在变量
x
:
x.scaled = new.mean + (x - mean(x)) * new.SD/sd(x)
Or, for the specific case of, say, SD=1000 and no change to the mean:或者,对于特定情况,例如 SD=1000 并且平均值没有变化:
x.scaled = mean(x) + (x - mean(x)) * 1000/sd(x)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.