繁体   English   中英

通过 Spark Scala 从 ElasticSearch 读取嵌套数据

[英]Reading Nested data from ElasticSearch via Spark Scala

我正在尝试通过 Spark Scala 从 Elasticsearch 读取数据:

Scala 2.11.8、Spark 2.3.0、Elasticsearch 5.6.8

连接 -- spark2-shell --jars elasticsearch-spark-20_2.11-5.6.8.jar

val df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", "xxxxxxx").option("es.port", "xxxx").option("es.net.http.auth.user","xxxxx").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.net.http.auth.pass", "xxxxxx").option("es.net.ssl", "true").option("es.nodes.wan.only", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("es.net.ssl.truststore.location", "xxxxx").option("es.net.ssl.truststore.pass", "xxxxx").option("es.read.field.as.array.include","true").option("pushdown", "true").option("es.read.field.as.array.include","a4,a4.a41,a4.a42,a4.a43,a4.a43.a431,a4.a43.a432,a4.a44,a4.a45").load("<index_name>") 

架构如下

 |-- a1: string (nullable = true)
 |-- a2: string (nullable = true)
 |-- a3: struct (nullable = true)
 |    |-- a31: integer (nullable = true)
 |    |-- a32: struct (nullable = true)
 |-- a4: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a41: string (nullable = true)
 |    |    |-- a42: string (nullable = true)
 |    |    |-- a43: struct (nullable = true)
 |    |    |    |-- a431: string (nullable = true)
 |    |    |    |-- a432: string (nullable = true)
 |    |    |-- a44: string (nullable = true)
 |    |    |-- a45: string (nullable = true)
 |-- a8: string (nullable = true)
 |-- a9: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a91: string (nullable = true)
 |    |    |-- a92: string (nullable = true)
 |-- a10: string (nullable = true)
 |-- a11: timestamp (nullable = true)

虽然我能够通过命令从直接列和嵌套模式级别 1(即 a9 或 a3 列)读取数据:

df.select(explode($"a9").as("exploded")).select("exploded.*").show

当我尝试读取 a4 元素时出现问题,因为它使我陷入以下错误:

    [Stage 18:>                                                         (0 + 1) / 1]20/02/28 02:43:23 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 54, xxxxxxx, executor 12): scala.MatchError: Buffer() (of class scala.collection.convert.Wrappers$JListWrapper)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:241)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$2.apply(CatalystTypeConverters.scala:164)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:164)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:154)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
        at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
        at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

20/02/28 02:43:23 ERROR scheduler.TaskSetManager: Task 0 in stage 18.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18.0 (TID 57, xxxxxxx, executor 12): scala.MatchError: Buffer() (of class scala.collection.convert.Wrappers$JListWrapper)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:241)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)

我做错了什么或遗漏了任何步骤? 请帮忙

出乎我的意料,当 spark/ElasticSearch 连接器猜测的架构实际上与正在读取的数据不兼容时,就会发生此错误。

在我看来, ES 是无模式的,而SparkSQL 有一个“硬”模式 弥合这个差距并不总是可能的,所以这一切都只是尽力而为。

连接两者时,连接器对文档进行采样并尝试猜测模式:“字段 A 是字符串,字段 B 是具有两个子字段的对象结构:B.1 是日期,B.2 是字符串数组, ... 任何”。

如果它猜错了(通常:给定的列/子列被猜测为字符串,但在某些文档中它实际上是一个数组或数字),那么 JSON 到 SparkSQL 的转换会发出这些类型的错误。

文档的话来说,它指出:

Elasticsearch 对具有单值或多值的字段一视同仁; 事实上,映射没有提供这方面的信息。 作为客户端,这意味着在实际读取之前无法判断字段是否为单值。 在大多数情况下,这不是问题,elasticsearch-hadoop 会自动动态创建必要的列表/数组。 但是,在 Spark SQL 等具有严格模式的环境中,不允许更改字段实际值的声明类型。 更糟糕的是,即使在读取数据之前,这些信息也需要可用。 由于映射不够确定,elasticsearch-hadoop 允许用户通过字段信息指定额外信息,特别是 es.read.field.as.array.include 和 es.read.field.as.array.exclude。

因此,我建议您检查您在问题中报告的架构(Spark 猜测的架构)是否对您的所有文档都有效。

如果不是,您有几个选择:

  1. 单独更正映射。 如果问题与未被识别的数组类型有关,您可以使用配置选项来实现 可以看到es.read.field.as.array.include (resp. .exclude ) 选项(用于主动告诉 Spark 文档中哪些属性是数组(resp. not array)。如果某个字段未使用, es.read.field.exclude是一个选项,它将完全从 Spark 中排除给定字段,绕过可能的架构问题。

  2. 如果无法为 ElasticSearch 提供适用于所有情况的有效模式(例如,某些字段有时是数字,有时是字符串,并且无法分辨),那么基本上,您只能回到 RDD级别(如果需要,一旦架构明确定义,请返回到数据集/数据帧)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM