How can I connect Azure Databricks to Cosmos DB using MongoDB API?

Question

I have created one azure CosmosDB account using MongoDB API. I need to connect CosmosDB(MongoDB API) to Azure Databricks cluster in order to read and write data from cosmos.

How to connect Azure Databricks cluster to CosmosDB account?

Answer 1

Here is the pyspark piece of code I use to connect to a CosmosDB database using MongoDB API from Azure Databricks (5.2 ML Beta (includes Apache Spark 2.4.0, Scala 2.11) and MongoDB connector: org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 ):

from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .getOrCreate()

df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
  .option("uri", CONNECTION_STRING) \
  .load()

With a CONNECTION_STRING that looks like that: "mongodb://USERNAME:PASSWORD@testgp.documents.azure.com:10255/ DATABASE_NAME.COLLECTION_NAME ?ssl=true&replicaSet=globaldb"

I tried a lot of different other options (adding database and collection names as option or config of the SparkSession) without success. Tell me if it works for you...

Answer 2

After adding the org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 package, this worked for me:

import json

query = {
  '$limit': 100,
}

query_config = {
  'uri': 'myConnectionString'
  'database': 'myDatabase',
  'collection': 'myCollection',
  'pipeline': json.dumps(query),
}

df = spark.read.format("com.mongodb.spark.sql") \
  .options(**query_config) \
  .load()

I do, however, get this error with some collections:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.6, executor 0): com.mongodb.MongoInternalException: The reply message length 10168676 is less than the maximum message length 4194304

Answer 3

Answering the same way I did to my own question.

Using MAVEN as the source, I installed the right library to my cluster using the path

org.mongodb.spark:mongo-spark-connector_2.11:2.4.0

Spark 2.4

An example of code I used is as follows (for those who wanna try):

# Read Configuration
readConfig = {
    "URI": "<URI>",
    "Database": "<database>",
    "Collection": "<collection>",
  "ReadingBatchSize" : "<batchSize>"
  }


pipelineAccounts = "{'$sort' : {'account_contact': 1}}"

# Connect via azure-cosmosdb-spark to create Spark DataFrame 
accountsTest = (spark.read.
                 format("com.mongodb.spark.sql").
                 options(**readConfig).
                 option("pipeline", pipelineAccounts).
                 load())

accountsTest.select("account_id").show()

How can I connect Azure Databricks to Cosmos DB using MongoDB API?

Question

3 answers

solution1
1 2019-01-22 13:08:18

solution2
0 2019-02-17 13:59:40

solution3
0 2020-07-01 17:48:35

How can I connect Azure Databricks to Cosmos DB using MongoDB API?

Question

3 answers

solution1 1 2019-01-22 13:08:18

solution2 0 2019-02-17 13:59:40

solution3 0 2020-07-01 17:48:35

solution1
1 2019-01-22 13:08:18

solution2
0 2019-02-17 13:59:40

solution3
0 2020-07-01 17:48:35