简体繁体 English

MongoDB 到 Elasticsearch 使用 PySpark

[英]MongoDB to Elasticsearch using PySpark

原文 2021-11-10 11:12:24 4 1 mongodb/ elasticsearch/ pyspark

I want to integrate my MongoDB collections to Elasticsearch using PySpark.我想使用 PySpark 将我的 MongoDB 集合集成到 Elasticsearch。 I have the connection string of my MongoDB, but I don't know how to structure my code or specify some parameters.我有我的 MongoDB 的连接字符串，但我不知道如何构造我的代码或指定一些参数。 Could someone give me the code example on creating this task?有人可以给我创建此任务的代码示例吗？ Though I'm still new in this area, I've tried to read some documentation, but I found myself stuck on some parameters and still don't get the clear flow of this task.虽然我在这方面还是个新手，但我已经尝试阅读一些文档，但我发现自己卡在了一些参数上，仍然没有得到这个任务的清晰流程。 Thanks for helping me.谢谢你帮助我。

1 个解决方案

In Spark you can work with RDDs or DataFrames.在 Spark 中，您可以使用 RDD 或数据帧。 RDDs are not structured and are the older Spark API. RDD 不是结构化的，是较旧的 Spark API。 To work with RDDs you need to use MapPartition , ForEachPartition and alike.要使用MapPartition ，您需要使用MapPartition 、 ForEachPartition等。 DataFrames are structured and enable using Spark SQL functions which are similar to ANSI SQL. DataFrames 是结构化的，可以使用类似于 ANSI SQL 的 Spark SQL 函数。

MongoDB enables a per-document schema. MongoDB 支持每文档模式。 Spark DataFrames requires a uniform schema per DataFrame. Spark DataFrames 要求每个 DataFrame 有一个统一的架构。

If your collection has a uniform schema for all the documents per collection (depends on whatever inserted the data), you can manually (via code) create a Spark schema definition ( pyspark.sql.types.StructType in PySpark).如果您的集合对每个集合的所有文档都有统一的架构（取决于插入的数据），您可以手动（通过代码）创建 Spark 架构定义（ pyspark.sql.types.StructType中的 pyspark.sql.types.StructType）。 For examples look at Spark documentation of StructType, DataType, StructDef, FieldType, ArrayDef, DictDef, StringType etc.例如，查看 StructType、DataType、StructDef、FieldType、ArrayDef、DictDef、StringType 等的 Spark 文档。

If your MongoDB collection does not have a uniform schema, you will need to use RDDs, then convert it to a uniform schema via MapPartition so that you use this uniform schema for the Elasticsearch index.如果您的 MongoDB 集合没有统一架构，您将需要使用 RDD，然后通过MapPartition将其转换为统一架构，以便您将此统一架构用于 Elasticsearch 索引。

You can use Spark SQL to read DataFrames from Mongo via pyspark with: spark_session.read.format('mongo').option('uri', 'mongodb://uri.to.mongo.goes.here').schema(schema=spark_schema_goes_here) and similarly write using df.write.format('mongo').option('uri', uri).mode(write_mode_eg_append).save()您可以使用 Spark SQL 通过 pyspark 从 Mongo 读取数据帧： spark_session.read.format('mongo').option('uri', 'mongodb://uri.to.mongo.goes.here').schema(schema=spark_schema_goes_here)并类似地使用df.write.format('mongo').option('uri', uri).mode(write_mode_eg_append).save()

I can't remember the Elastic syntax, however, it is similar and there the definition is schema per index.我不记得弹性语法，但是，它是相似的，并且定义是每个索引的模式。 Try searching for Spark in MongoDB site and in Elasticsearch site to find more details.尝试在 MongoDB 站点和 Elasticsearch 站点中搜索 Spark 以查找更多详细信息。

It is possible to create the Elastic index definition automatically from the Spark schema definition, I once did this, however, do not have the code.可以从 Spark 模式定义自动创建弹性索引定义，我曾经这样做过，但是，没有代码。