[英]Pca analysis with PySpark
I am working on PCA analysis using PySpark as a tool, but I'm having errors due to compatibity of data read from the csv file. 我正在使用PySpark作为工具进行PCA分析,但由于从csv文件读取的数据的兼容性,我遇到了错误。 What sould I do?
我该怎么办? would you please help me?
你能帮我吗?
from __future__ import print_function
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
import pandas as pd
import numpy as np
from numpy import array
conf = SparkConf().setAppName("building a warehouse")
sc = SparkContext(conf=conf)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PCAExample")\
.getOrCreate()
data = sc.textFile('dataset.csv') \
.map(lambda line: line.split(','))\
.collect()
#create a data frame from data read from csv file
df = spark.createDataFrame(data, ["features"])
#convert data to vector udt
df.show()
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
spark.stop()
here is the error I'm getting: 这是我得到的错误:
File "C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_bigdata.py", line 38, in <module>
model = pca.fit(df)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually StringType.'
Here error specifies itself column need to be VectorUDT
instead StringType
. 这里错误指定自己列需要是
VectorUDT
而不是StringType
。 So this will work for you:- 所以这对你有用: -
from pyspark.mllib.linalg import SparseVector, VectorUDT
from pyspark.sql.types import StringType, StructField, StructType
df = spark.createDataFrame(data, StructType([
StructField("features", VectorUDT(), True)
]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.