简体   繁体   中英

Deeplearning4j to spark pipeline: Convert a String type to org.apache.spark.mllib.linalg.VectorUDT

I have a sentiment analysis program to predict whether a given movie review is positive or negative using recurrent neutral network. I'm using Deeplearning4j deep learning library for that program. Now I need to add that program to apache spark pipeline.

When doing it, I have a class MovieReviewClassifier which extends org.apache.spark.ml.classification.ProbabilisticClassifier and I have to add an instance of that class to the pipeline. The features which are needed to build the model are entered to the program using setFeaturesCol(String s) method. The features I add are in String format since they are a set of strings used for sentiment analysis. But the features should be in the form org.apache.spark.mllib.linalg.VectorUDT . Is there a way to convert the strings to Vector UDT?

I have attached my code for pipeline implementation below:

public class RNNPipeline {
    final static String RESPONSE_VARIABLE =  "s";
    final static String INDEXED_RESPONSE_VARIABLE =  "indexedClass";
    final static String FEATURES = "features";
    final static String PREDICTION = "prediction";
    final static String PREDICTION_LABEL = "predictionLabel";

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("test-client").setMaster("local[2]");
        sparkConf.set("spark.driver.allowMultipleContexts", "true");
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
        SQLContext sqlContext = new SQLContext(javaSparkContext);

        // ======================== Import data ====================================
        DataFrame dataFrame =    sqlContext.read().format("com.databricks.spark.csv")
                .option("inferSchema", "true")
                .option("header", "true")
                .load("/home/RNN3/WordVec/training.csv");

        // Split in to train/test data
        double [] dataSplitWeights = {0.7,0.3};
        DataFrame[] data = dataFrame.randomSplit(dataSplitWeights);



        // ======================== Preprocess ===========================



        // Encode labels
        StringIndexerModel labelIndexer = new StringIndexer().setInputCol(RESPONSE_VARIABLE)
                .setOutputCol(INDEXED_RESPONSE_VARIABLE)
                .fit(data[0]);


        // Convert indexed labels back to original labels (decode labels).
        IndexToString labelConverter = new IndexToString().setInputCol(PREDICTION)
                .setOutputCol(PREDICTION_LABEL)
                .setLabels(labelIndexer.labels());


        // ======================== Train ========================



        MovieReviewClassifier mrClassifier = new MovieReviewClassifier().setLabelCol(INDEXED_RESPONSE_VARIABLE).setFeaturesCol("Review");



        // Fit the pipeline for training..setLabelCol.setLabelCol.setLabelCol.setLabelCol
        Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { labelIndexer, mrClassifier, labelConverter});
        PipelineModel pipelineModel = pipeline.fit(data[0]);

        }

  }

Review is the feature column which contains strings to be predicted as positive or negative.

I get the following error when I execute the code:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column Review must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually StringType.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
    at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
    at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
    at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
    at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:167)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:62)
    at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:121)
    at RNNPipeline.main(RNNPipeline.java:82)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

According to its documentation

User-defined type for Vector which allows easy interaction with SQL via DataFrame.

And the fact that in the ML library

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

and the fact you are asked for org.apache.spark.sql.types.UserDefinedType<Vector>

You can probably get away by passing either a DenseVector or SparseVector , created from your String .

The conversion from String ( "Review" ??? ) to a Vector depends on how you have organized your data.

The way to convert String type to verctor UDT is using word2vec. I have to add an word2vec object to the spark pipeline to do the conversion.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM