[英]Input type must be string type but got ArrayType(StringType,true) error in Spark using Scala
I am new to Spark and am using Scala to create a basic classifier. 我是Spark新手,我正在使用Scala创建一个基本的分类器。 I am reading from a textfile as a dataset and splitting it into training and test data sets.
我正在从文本文件中读取数据集并将其拆分为训练和测试数据集。 Then I'm trying to tokenize the training data but it fails with
然后我试图将训练数据标记化,但它失败了
Caused by: java.lang.IllegalArgumentException: requirement failed: Input type must be string type but got ArrayType(StringType,true).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.feature.RegexTokenizer.validateInputType(Tokenizer.scala:149)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:110)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
at com.classifier.classifier_app.App$.<init>(App.scala:91)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
... 1 more
error. 错误。
The code is as below: 代码如下:
val input_path = "path//to//file.txt"
case class Sentence(value: String)
val sentencesDS = spark.read.textFile(input_path).as[Sentence]
val Array(trainingData, testData) = sentencesDS.randomSplit(Array(0.7, 0.3))
val tokenizer = new Tokenizer()
.setInputCol("value")
.setOutputCol("words")
val pipeline = new Pipeline().setStages(Array(tokenizer, regexTokenizer, remover, hashingTF, ovr))
val model = pipeline.fit(trainingData)
How do I solve this? 我该如何解决这个问题? Any help is appreciated.
任何帮助表示赞赏。
I have defined all the stages in the pipeline but haven't put them here in the code snippet. 我已经定义了管道中的所有阶段,但没有将它们放在代码片段中。
The error was resolved when the order of execution in pipeline was changed. 当管道中的执行顺序发生更改时,错误已解决。
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF))
val model = pipeline.fit(trainingData)
The tokenizer was replaced with regexTokenizer. 令牌化程序已替换为regexTokenizer。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.