简体   繁体   中英

MongoDB to Elasticsearch using PySpark

I want to integrate my MongoDB collections to Elasticsearch using PySpark. I have the connection string of my MongoDB, but I don't know how to structure my code or specify some parameters. Could someone give me the code example on creating this task? Though I'm still new in this area, I've tried to read some documentation, but I found myself stuck on some parameters and still don't get the clear flow of this task. Thanks for helping me.

In Spark you can work with RDDs or DataFrames. RDDs are not structured and are the older Spark API. To work with RDDs you need to use MapPartition , ForEachPartition and alike. DataFrames are structured and enable using Spark SQL functions which are similar to ANSI SQL.

MongoDB enables a per-document schema. Spark DataFrames requires a uniform schema per DataFrame.

If your collection has a uniform schema for all the documents per collection (depends on whatever inserted the data), you can manually (via code) create a Spark schema definition ( pyspark.sql.types.StructType in PySpark). For examples look at Spark documentation of StructType, DataType, StructDef, FieldType, ArrayDef, DictDef, StringType etc.

If your MongoDB collection does not have a uniform schema, you will need to use RDDs, then convert it to a uniform schema via MapPartition so that you use this uniform schema for the Elasticsearch index.

You can use Spark SQL to read DataFrames from Mongo via pyspark with: spark_session.read.format('mongo').option('uri', 'mongodb://uri.to.mongo.goes.here').schema(schema=spark_schema_goes_here) and similarly write using df.write.format('mongo').option('uri', uri).mode(write_mode_eg_append).save()

I can't remember the Elastic syntax, however, it is similar and there the definition is schema per index. Try searching for Spark in MongoDB site and in Elasticsearch site to find more details.

It is possible to create the Elastic index definition automatically from the Spark schema definition, I once did this, however, do not have the code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM