简体   繁体   English

输入类型必须是字符串类型,但使用Scala在Spark中得到ArrayType(StringType,true)错误

[英]Input type must be string type but got ArrayType(StringType,true) error in Spark using Scala

I am new to Spark and am using Scala to create a basic classifier. 我是Spark新手,我正在使用Scala创建一个基本的分类器。 I am reading from a textfile as a dataset and splitting it into training and test data sets. 我正在从文本文件中读取数据集并将其拆分为训练和测试数据集。 Then I'm trying to tokenize the training data but it fails with 然后我试图将训练数据标记化,但它失败了

Caused by: java.lang.IllegalArgumentException: requirement failed: Input type must be string type but got ArrayType(StringType,true).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.feature.RegexTokenizer.validateInputType(Tokenizer.scala:149)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:110)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
at com.classifier.classifier_app.App$.<init>(App.scala:91)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
... 1 more

error. 错误。

The code is as below: 代码如下:

val input_path = "path//to//file.txt"

case class Sentence(value: String)
val sentencesDS = spark.read.textFile(input_path).as[Sentence]  

val Array(trainingData, testData) = sentencesDS.randomSplit(Array(0.7, 0.3))     

val tokenizer = new Tokenizer()
  .setInputCol("value")
  .setOutputCol("words")

val pipeline = new Pipeline().setStages(Array(tokenizer, regexTokenizer, remover, hashingTF, ovr))
val model = pipeline.fit(trainingData)

How do I solve this? 我该如何解决这个问题? Any help is appreciated. 任何帮助表示赞赏。

I have defined all the stages in the pipeline but haven't put them here in the code snippet. 我已经定义了管道中的所有阶段,但没有将它们放在代码片段中。

The error was resolved when the order of execution in pipeline was changed. 当管道中的执行顺序发生更改时,错误已解决。

val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF))
val model = pipeline.fit(trainingData) 

The tokenizer was replaced with regexTokenizer. 令牌化程序已替换为regexTokenizer。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 VectorAssembler 不支持 StringType 类型的 scala spark 转换 - VectorAssembler does not support the StringType type scala spark convert 得到像需要结构类型的错误,但在简单的结构类型的spark scala中得到了字符串 - Getting error like need struct type but got string in spark scala for simple struct type 使用UDF Spark将字符串的嵌套ArrayType转换为日期的嵌套ArrayType - Converting nested ArrayType of String to Nested ArrayType of date using UDF spark 火花中的Scala类型MisMatch错误 - Scala Type MisMatch Error in spark ScalaTestFailureLocation预期的StructField(value1,ArrayType(StringType,true),false)实际的StructField(val2,ArrayType(StringType,true),true) - ScalaTestFailureLocation Expected StructField(value1,ArrayType(StringType,true),false) Actual StructField(val2,ArrayType(StringType,true),true) Spark 2.0.1:将JSON数组列拆分为ArrayType(StringType) - Spark 2.0.1: split JSON Array Column into ArrayType(StringType) 在scala spark streaming中使用foreach时不希望字符串为类型? - do not want string as type when using foreach in scala spark streaming? 对用户输入 Scala Spark 进行类型检查 - Type checking on user input Scala Spark 在Spark Scala中创建ArrayType列 - Create ArrayType column in Spark scala 将 Stringtype 转换为 ArrayType - Cast Stringtype to ArrayType
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM