简体   繁体   English

从使用 pyspark 获得的数据帧中缩放数据

[英]Scale data from dataframe obtained with pyspark

I'm trying to scale some data from a csv file.我正在尝试从 csv 文件中缩放一些数据。 I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part.我正在使用 pyspark 执行此操作以获取比例部分的数据框和 sklearn。 Here is the code:这是代码:

from sklearn import preprocessing
import numpy as np
import pyspark

from pysparl.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)

If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:如果我使用熊猫制作数据框,比例部分没有任何问题,但是使用 spark 我会收到此错误:

ValueError: setting an array element with a sequence.

So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?所以我猜测pandas和pyspark之间的元素类型是不同的,但是我如何使用pyspark来进行缩放?

sklearn works with pandas dataframe. sklearn 适用于熊猫数据框。 So you have to convert spark dataframe to pandas dataframe.所以你必须将spark数据帧转换为pandas数据帧。

X_scaled = preprocessing.scale(df.toPandas())

You can use the "StandardScaler" method from "pyspark.ml.feature".您可以使用“pyspark.ml.feature”中的“StandardScaler”方法。 Attaching a sample script to perform the exact pre-processing as sklearn,附加一个示例脚本以执行 sklearn 的精确预处理,

Step 1:第1步:

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", 
                        outputCol="scaled_features",
                        withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)

Remember before you perform step 1, you need to assemble all the features with VectorAssembler.请记住,在执行步骤 1 之前,您需要使用 VectorAssembler 组装所有功能。 Hence this will be your step 0.因此,这将是您的第 0 步。

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM