简体   繁体   English

Spark中用于分类的标记点

[英]Labelling points for classification in Spark

I'm trying to run multiple classifiers on this telecom dataset to predict churn. 我正在尝试在电信数据集上运行多个分类器,以预测客户流失。 So far, I've loaded my dataset into a Spark RDD, but I'm not sure how I can select one column to be a label - in this case, the last column. 到目前为止,我已经将数据集加载到Spark RDD中,但是我不确定如何选择一个列作为标签-在这种情况下,是最后一列。 Not asking for code, but a short explanation on how RDDs and LabeledPoint work together. 不要求提供代码,而是简要说明RDD和LabeledPoint如何协同工作。 I looked at examples provided in the official Spark github, but they seem to use the libsvm format. 我看了官方Spark github中提供的示例,但它们似乎使用libsvm格式。

Question: how does LabeledPoint work, and how can I specify what my label is? 问题:LabeledPoint如何工作,如何指定标签是什么?

My code so far, if it helps: 到目前为止,我的代码是否有帮助:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}

object{
   def main(args: Array[String]): Unit = {
    //setting spark context
    val conf = new SparkConf().setAppName("Churn")
    val sc = new SparkContext(conf)
    //loading and mapping data into RDD
    val csv = sc.textFile("file://filename.csv")
    val data = csv.map(line => line.split(",").map(elem => elem.trim))
    /* computer learns which points are features and labels here */
}
}

The dataset looks like this: 数据集如下所示:

State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.
OH,107,415,371-7191,no,yes,26,161.600000,123,27.470000,195.500000,103,16.620000,254.400000,103,11.450000,13.700000,3,3.700000,1,False.
NJ,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,False.

You need to decide what your features are: for example the phone number will not be a feature. 您需要确定功能是什么:例如,电话号码将不是功能。 So, some columns will be dropped. 因此,某些列将被删除。 Then, you want to transform the string columns to numbers. 然后,您要将字符串列转换为数字。 Yes, you could do it with ML transformers, but it's an overkill in this situation. 是的,您可以使用ML变压器来做到这一点,但是在这种情况下,这太过分了。 I'd do it like this (showing the logic on a single line of your data): 我会这样做(在数据的一行上显示逻辑):

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

val line = "NJ,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,False"
val arrl = line.split(",").map(_.trim)
val mr = Map("no"-> "0.0", "yes"-> "0.0", "False"->"0.0", "True" ->"1.0")
val stringvec = Array( arrl(2), mr(arrl(4)), mr(arrl(5))   ) ++ arrl.slice(6, 20)

val label = mr(arrl(20)).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint( label, Vectors.dense(vec))

So, to answer your question: a labeled point is the target variable (in this case, the last column (as a Double), has the customer churned or not), plus the vector of numeric (Double) features describing the customer ( vec in this case). 因此,要回答您的问题:标记点是目标变量(在这种情况下,最后一列(为Double),是否已搅动了客户),加上描述客户的数字(Double)特征向量( vec在这种情况下)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM