从使用 pyspark 获得的数据帧中缩放数据

Question

I'm trying to scale some data from a csv file.我正在尝试从 csv 文件中缩放一些数据。 I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part.我正在使用 pyspark 执行此操作以获取比例部分的数据框和 sklearn。 Here is the code:这是代码：

from sklearn import preprocessing
import numpy as np
import pyspark

from pysparl.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)

If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:如果我使用熊猫制作数据框，比例部分没有任何问题，但是使用 spark 我会收到此错误：

ValueError: setting an array element with a sequence.

So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?所以我猜测pandas和pyspark之间的元素类型是不同的，但是我如何使用pyspark来进行缩放？

Answer 1

sklearn works with pandas dataframe. sklearn 适用于熊猫数据框。 So you have to convert spark dataframe to pandas dataframe.所以你必须将spark数据帧转换为pandas数据帧。

X_scaled = preprocessing.scale(df.toPandas())

Answer 2

You can use the "StandardScaler" method from "pyspark.ml.feature".您可以使用“pyspark.ml.feature”中的“StandardScaler”方法。 Attaching a sample script to perform the exact pre-processing as sklearn,附加一个示例脚本以执行 sklearn 的精确预处理，

Step 1:第1步：

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", 
                        outputCol="scaled_features",
                        withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)

Remember before you perform step 1, you need to assemble all the features with VectorAssembler.请记住，在执行步骤 1 之前，您需要使用 VectorAssembler 组装所有功能。 Hence this will be your step 0.因此，这将是您的第 0 步。

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)

从使用 pyspark 获得的数据帧中缩放数据

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-03-07 14:35:17

解决方案2
0 2020-06-29 14:32:57

从使用 pyspark 获得的数据帧中缩放数据

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-03-07 14:35:17

解决方案2 0 2020-06-29 14:32:57

解决方案1
1 已采纳 2019-03-07 14:35:17

解决方案2
0 2020-06-29 14:32:57