简体   繁体   中英

spark converting CSV to libsvm format

I have a CSV file with state, age, gender, salary etc as independent variables.

Dependent variable is churn.

In spark, we need to convert the dataframe to libsvm format. can you one tell me how to do it.

libsvm format is : 0 128:51

AS A FEATURE VALUE HERE MEANS THAT THERE IS VALUE 51 IN COLUMN 128.

/*
/Users/mac/matrix.txt
1 0.5 2.4 3.0
1 99 34 6454
2 0.8 3.0 4.5
*/
def concat(a:Array[String]):String ={
  var result=a(0)+" "
  for(i<-1 to a.size.toInt-1) 
  result=result+i+":"+a(i)(0)+" "
  return result
}
val rfile=sc.textFile("file:///Users/mac/matrix.txt")
val f=rfile.map(line => line.split(' ')).map(i=>concat(i))

i believe i have a much simpler solution.

I was using hadoop for the same but logic should be same. I have created sample example for your use-case. Here first I am creating data-frame and than removing all the rows which have either null or blank values. After that creating RDD and converting Row into libsvm format. "repartition(1)" means everything will go into one file only.There will be one resultant column eg. in case of CTR prediction it will be 1 or 0 only.

Sample file input :

"zip","city","state","latitude","longitude","timezone","dst"
"00210","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00211","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00212","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00213","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00214","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00215","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00501","Holtsville","NY","40.922326","-72.637078","-5","1"
"00544","Holtsville","NY","40.922326","-72.637078","-5","1"

public class LibSvmConvertJob {

    private static final String SPACE = " ";
    private static final String COLON = ":";

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("Libsvm Convertor");

        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);

        SQLContext sqlContext = new SQLContext(javaSparkContext);

        DataFrame inputDF = sqlContext.read().format("com.databricks.spark.csv").option("header", "true")
                .load("/home/raghunandangupta/inputfiles/zipcode.csv");

        inputDF.printSchema();

        sqlContext.udf().register("convertToNull", (String v1) -> (v1.trim().length() > 0 ? v1.trim() : null), DataTypes.StringType);

        inputDF = inputDF.selectExpr("convertToNull(zip)","convertToNull(city)","convertToNull(state)","convertToNull(latitude)","convertToNull(longitude)","convertToNull(timezone)","convertToNull(dst)").na().drop();

        inputDF.javaRDD().map(new Function<Row, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public String call(Row v1) throws Exception {
                StringBuilder sb = new StringBuilder();
                sb.append(hashCode(v1.getString(0))).append("\t")   //Resultant column
                .append("1"+COLON+hashCode(v1.getString(1))).append(SPACE)
                .append("2"+COLON+hashCode(v1.getString(2))).append(SPACE)
                .append("3"+COLON+hashCode(v1.getString(3))).append(SPACE)
                .append("4"+COLON+hashCode(v1.getString(4))).append(SPACE)
                .append("5"+COLON+hashCode(v1.getString(5))).append(SPACE)
                .append("6"+COLON+hashCode(v1.getString(6)));
                return sb.toString();
            }
            private String hashCode(String value) {
                return Math.abs(Hashing.murmur3_32().hashString(value, StandardCharsets.UTF_8).hashCode()) + "";
            }
        }).repartition(1).saveAsTextFile("/home/raghunandangupta/inputfiles/zipcode");

    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM