简体   繁体   English

如何转换一组值,使每个值更接近平均值,但在 PySpark 中具有类似形状的分布(即减少 stdev)

[英]How to convert an array of values so that each value is closer the mean, but with a similarly shaped distribution (i.e. reduce the stdev) in PySpark

I hope I've described the job I need to do in the correct terms.我希望我已经用正确的术语描述了我需要做的工作。 Essentially, I need to 'compress' a series of values so that all the values are closer to the mean, but their values should be reduced (or increased) relative to their distance from the mean...本质上,我需要“压缩”一系列值,以便所有值都更接近平均值,但它们的值应该相对于它们与平均值的距离减少(或增加)......

The dataframe looks like this:数据框如下所示:

>>> df[['population', 'postalCode']].show(10)
+----------+----------+
|population|postalCode|
+----------+----------+
|      1464|     96028|
|       465|     96015|
|       366|     96016|
|      5490|     96101|
|       183|     96068|
|       569|     96009|
|       366|     96054|
|        90|     96119|
|       557|     96006|
|       233|     96116|
+----------+----------+
only showing top 10 rows

>>> df.describe().show()
+-------+------------------+------------------+
|summary|        population|        postalCode|
+-------+------------------+------------------+
|  count|              1082|              1082|
|   mean|23348.511090573014| 93458.60813308688|
| stddev|21825.045923603615|1883.6307236060127|
+-------+------------------+------------------+

The population mean is about right for my purposes, but I need the variance around it to be smaller...总体均值符合我的目的,但我需要它周围的方差更小......

Hope that makes sense, any help performing this job either in pyspark or node.js greatly appreciated.希望这是有道理的,非常感谢在 pyspark 或 node.js 中执行此工作的任何帮助。

The general idea is to:总体思路是:

  1. translate the mean to zero.将平均值转换为零。
  2. rescale to the new standard deviation重新调整到新的标准偏差
  3. translate to the desired mean (in this case, the original mean)转换为所需的平均值(在本例中为原始平均值)

In pseudo-code, if your values are stored in the variable x :在伪代码中,如果您的值存储在变量x

x.scaled = new.mean + (x - mean(x)) * new.SD/sd(x)

Or, for the specific case of, say, SD=1000 and no change to the mean:或者,对于特定情况,例如 SD=1000 并且平均值没有变化:

x.scaled = mean(x) + (x - mean(x)) * 1000/sd(x)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 开玩笑如何测试 map 内容,即键和值 - Jest how to test a map contents i.e. key and value 如何处理(即记录)纠正错误 - How to handle (i.e. log) errors in restify 如何通过查询参数在 mongoose 中创建增量范围聚合。 即从字段值创建增量范围 - How to create incremental range aggregation in mongoose from a query parameter. i.e. create an incremental range from a field value 如何在Javascript中实现map,其中value是一个字符串数组,就像我们在Java中一样? - How to implement map in Javascript where value is an array of strings similarly as we have in Java? JavaScript:制作和过滤/创建对象的明智日期数组。 即Skype聊天历史视图 - JavaScript: Make and Filter/Create Date wise array of object. i.e. skype chat history view node.js 可以编写一个方法,以便可以两种方式调用它 - 即使用回调或异步/等待? - node.js can a method be written so it can be called both ways - i.e. with callback or async/await? 如何查询 DynamoDB 中任何项目的 GSI 范围(即不依赖于分区键)? - How do I query the range of a GSI of any item in DynamoDB (i.e. not partition key dependent)? 如何使用 sequelize ORM 实现集群,即多线程 - How can I implement clustering with sequelize ORM i.e. multithreading 如何在Amber-Smalltalk中加载外部librarie即moment.js? - How to load external librarie i.e. moment.js in Amber-Smalltalk? 如何在node.js中更改文件系统的目录(外部目录,即桌面) - How to change directory (outside directory i.e. desktop) of file system in node.js
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM