简体   繁体   English

使用PySpark进行Pca分析

[英]Pca analysis with PySpark

I am working on PCA analysis using PySpark as a tool, but I'm having errors due to compatibity of data read from the csv file. 我正在使用PySpark作为工具进行PCA分析,但由于从csv文件读取的数据的兼容性,我遇到了错误。 What sould I do? 我该怎么办? would you please help me? 你能帮我吗?

from __future__ import print_function
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors, VectorUDT

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
import pandas as pd
import numpy as np
from numpy import array


conf = SparkConf().setAppName("building a warehouse")
sc = SparkContext(conf=conf)

if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("PCAExample")\
        .getOrCreate()



   data = sc.textFile('dataset.csv') \
        .map(lambda line:  line.split(','))\
        .collect()
   #create a data frame from data read from csv file 
   df = spark.createDataFrame(data, ["features"])
   #convert data to vector udt

   df.show()


   pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
   model = pca.fit(df)

   result =  model.transform(df).select("pcaFeatures")
   result.show(truncate=False)

   spark.stop()

here is the error I'm getting: 这是我得到的错误:

File "C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_bigdata.py", line 38, in       <module>
model = pca.fit(df)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually StringType.'

Here error specifies itself column need to be VectorUDT instead StringType . 这里错误指定自己列需要是VectorUDT而不是StringType So this will work for you:- 所以这对你有用: -

from pyspark.mllib.linalg import SparseVector, VectorUDT       
from pyspark.sql.types import StringType, StructField, StructType
df = spark.createDataFrame(data, StructType([
                         StructField("features", VectorUDT(), True)
                       ]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM