简体   繁体   中英

Logistic Regression in Spark for predictive analysis

I am a beginner in spark,big data and scala, I am trying to build a predictive model in Spark with a sample data-set. I wanted to use pySpark but currently mllib for pyspark has limitations as it doesn't do save and load. I have a couple of questions:

  1. My data is in csv format and looks like this:

     Buy,Income,Is Female,Is Married,Has College,Is Professional,Is Retired,Unemployed,Residence Length,Dual Income,Minors,Own,House,White,English,Prev Child Mag,Prev Parent Mag 0,24000,1,0,1,1,0,0,26,0,0,0,1,0,0,0,0 1,75000,1,1,1,1,0,0,15,1,0,1,1,1,1,1,0 

Basically this data helps predict whether a user buys this magazine or not based on all the given parameters.

How can I convert this data into a format easily interpreted by Spark? (I have looked at other related answers here about converting csv into RDD and have tried them but it has made me more confused than before)

  1. If I just run the logistic regression program given in the mllib documentation on this data where part of data is used for training and the other for testing, how do I convert it into a demo-able format where I have a new user and the programs walks me through all the parameters and by the end of it gives me a "yes" or "no" on whether this new person will buy the magazine or not.

     import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE) // Save and load model model.save(sc, "myModelPath") val sameModel = LinearRegressionModel.load(sc, "myModelPath") 

Basically where do I go from here if I use this program as my starting point?

Having that model you could predict if input data satisfy model (1) or don't (0). To do so:

    val yourInputData = //putYouDataHere
    val res = model.predict(Vectors.dense(yourInputData))
    println(res)

Vector that you've passed to predict method should have same number of dimensions as in data that was used to conctruct model that is in : "data/mllib/ridge-data/lpsa.data"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM