简体   繁体   English

Spark中的Logistic回归进行预测分析

[英]Logistic Regression in Spark for predictive analysis

I am a beginner in spark,big data and scala, I am trying to build a predictive model in Spark with a sample data-set. 我是Spark,大数据和Scala的初学者,我正在尝试使用示例数据集在Spark中构建预测模型。 I wanted to use pySpark but currently mllib for pyspark has limitations as it doesn't do save and load. 我想使用pySpark,但是目前pyspark的mllib受到限制,因为它无法保存和加载。 I have a couple of questions: 我有一些问题:

  1. My data is in csv format and looks like this: 我的数据为csv格式,如下所示:

     Buy,Income,Is Female,Is Married,Has College,Is Professional,Is Retired,Unemployed,Residence Length,Dual Income,Minors,Own,House,White,English,Prev Child Mag,Prev Parent Mag 0,24000,1,0,1,1,0,0,26,0,0,0,1,0,0,0,0 1,75000,1,1,1,1,0,0,15,1,0,1,1,1,1,1,0 

Basically this data helps predict whether a user buys this magazine or not based on all the given parameters. 基本上,这些数据有助于根据所有给定参数预测用户是否购买了该杂志。

How can I convert this data into a format easily interpreted by Spark? 如何将这些数据转换为Spark易于解释的格式? (I have looked at other related answers here about converting csv into RDD and have tried them but it has made me more confused than before) (我在这里查看了有关将csv转换为RDD的其他相关答案,并进行了尝试,但这使我比以前更加困惑)

  1. If I just run the logistic regression program given in the mllib documentation on this data where part of data is used for training and the other for testing, how do I convert it into a demo-able format where I have a new user and the programs walks me through all the parameters and by the end of it gives me a "yes" or "no" on whether this new person will buy the magazine or not. 如果我仅对此数据运行mllib文档中提供的逻辑回归程序,其中部分数据用于培训,另一部分数据用于测试,如何在拥有新用户和程序的情况下将其转换为可演示的格式引导我了解所有参数,最后告诉我这个新人是否会购买该杂志的“是”或“否”。

     import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE) // Save and load model model.save(sc, "myModelPath") val sameModel = LinearRegressionModel.load(sc, "myModelPath") 

Basically where do I go from here if I use this program as my starting point? 基本上,如果我将此程序用作起点,该从哪里去?

Having that model you could predict if input data satisfy model (1) or don't (0). 有了该模型,您可以预测输入数据是否满足模型(1)或不满足(0)。 To do so: 为此:

    val yourInputData = //putYouDataHere
    val res = model.predict(Vectors.dense(yourInputData))
    println(res)

Vector that you've passed to predict method should have same number of dimensions as in data that was used to conctruct model that is in : "data/mllib/ridge-data/lpsa.data" 您传递给预测方法的向量的维数应与用于构造模型的数据的维数相同:“ data / mllib / ridge-data / lpsa.data”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM