[英]Scale data from dataframe obtained with pyspark
I'm trying to scale some data from a csv file.我正在尝试从 csv 文件中缩放一些数据。 I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part.我正在使用 pyspark 执行此操作以获取比例部分的数据框和 sklearn。 Here is the code:这是代码:
from sklearn import preprocessing
import numpy as np
import pyspark
from pysparl.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)
If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:如果我使用熊猫制作数据框,比例部分没有任何问题,但是使用 spark 我会收到此错误:
ValueError: setting an array element with a sequence.
So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?所以我猜测pandas和pyspark之间的元素类型是不同的,但是我如何使用pyspark来进行缩放?
sklearn works with pandas dataframe. sklearn 适用于熊猫数据框。 So you have to convert spark dataframe to pandas dataframe.所以你必须将spark数据帧转换为pandas数据帧。
X_scaled = preprocessing.scale(df.toPandas())
You can use the "StandardScaler" method from "pyspark.ml.feature".您可以使用“pyspark.ml.feature”中的“StandardScaler”方法。 Attaching a sample script to perform the exact pre-processing as sklearn,附加一个示例脚本以执行 sklearn 的精确预处理,
Step 1:第1步:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaled_features",
withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)
Remember before you perform step 1, you need to assemble all the features with VectorAssembler.请记住,在执行步骤 1 之前,您需要使用 VectorAssembler 组装所有功能。 Hence this will be your step 0.因此,这将是您的第 0 步。
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.