繁体   English   中英

无法将Seq [String]传递给Spark Scala中的.parquet

[英]Unable to pass a Seq[String] to .parquet in Spark Scala

我正在尝试使用.parquet方法在Scala的Spark API中一次调用中读取多个路径。

我有一个接收Seq[String]的方法,但是当包含在方法调用中并尝试检索String而不是Seq[String]时似乎无法识别它。

def readPaths(sparkSession: SparkSession, basePath: String, inputPaths: Seq[String]): Dataset[Row] = {
  sparkSession.read
    .option("basepath", basePath)
    .parquet(inputPaths) // Doesn't accept 'inputPaths'

}

在注释部分,它只是抱怨Seq[String]不是String类型的对象,同时它确实接受简单的"", "", "", ""

的:

def parquet(paths: String*): DataFrame

方法期望一个可变参数,而不是一个明确的Seq。 因此,在Scala中,您必须将其传递为:

    def readPaths(sparkSession: SparkSession, basePath: String, inputPaths: Seq[String]): Dataset[Row] = {
  sparkSession.read
    .option("basepath", basePath)
    .parquet(inputPaths:_*)
  }

请注意在值末尾的“:_ *”。

在spark2-shell上验证(使用Spark 2.3.0.cloudera3):

scala> case class MyProduct(key: Int, value: String, lastSeen: java.sql.Timestamp)
defined class MyProduct

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val baseDS = spark.createDataset(0 until 1000).map(i => MyProduct(i, s"valueOf:$i", new java.sql.Timestamp(System.currentTimeMillis())))
baseDS: org.apache.spark.sql.Dataset[MyProduct] = [key: int, value: string ... 1 more field]

scala> baseDS.withColumn("state", lit("IT"))
res10: org.apache.spark.sql.DataFrame = [key: int, value: string ... 2 more fields]

scala> res10.write.mode("overwrite").partitionBy("state").parquet("/tmp/test/parquet/")

scala> baseDS.withColumn("state", lit("US"))
res12: org.apache.spark.sql.DataFrame = [key: int, value: string ... 2 more fields]

scala> res12.write.mode("append").partitionBy("state").parquet("/tmp/test/parquet/")

scala> val inputPaths = Seq("/tmp/test/parquet/state=IT", "/tmp/test/parquet/state=US")
inputPaths: Seq[String] = List(/tmp/test/parquet/state=IT, /tmp/test/parquet/state=US)

scala> val data = spark.read.parquet(inputPaths)
<console>:38: error: overloaded method value parquet with alternatives:
  (paths: String*)org.apache.spark.sql.DataFrame <and>
  (path: String)org.apache.spark.sql.DataFrame
 cannot be applied to (Seq[String])
       val data = spark.read.parquet(inputPaths)
                             ^

scala> val data = spark.read.parquet(inputPaths:_*)
data: org.apache.spark.sql.DataFrame = [key: int, value: string ... 1 more field]

scala> data.show(10)
+---+-----------+--------------------+
|key|      value|            lastSeen|
+---+-----------+--------------------+
|500|valueOf:500|2019-02-04 17:05:...|
|501|valueOf:501|2019-02-04 17:05:...|
|502|valueOf:502|2019-02-04 17:05:...|
|503|valueOf:503|2019-02-04 17:05:...|
|504|valueOf:504|2019-02-04 17:05:...|
|505|valueOf:505|2019-02-04 17:05:...|
|506|valueOf:506|2019-02-04 17:05:...|
|507|valueOf:507|2019-02-04 17:05:...|
|508|valueOf:508|2019-02-04 17:05:...|
|509|valueOf:509|2019-02-04 17:05:...|
+---+-----------+--------------------+
only showing top 10 rows


scala>

我认为parquet()函数期望使用“ varargs”参数,即String类型的一个或多个参数。

您可以给它传递一个Seq[String]但是必须给编译器一个提示,告诉它将Seq解压缩为多个参数。

演示varargs用法的示例:

scala> def foo(i: String*) = i.mkString(",")
foo: (i: String*)String

scala> foo("a", "b", "c")
res0: String = a,b,c

scala> foo(Seq("a", "b", "c"))
<console>:13: error: type mismatch;
 found   : Seq[String]
 required: String
       foo(Seq("a", "b", "c"))
              ^

scala> foo(Seq("a", "b", "c"):_*)
res2: String = a,b,c

如您所见, :_*提示解决了该问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM