简体   繁体   中英

How can I connect Azure Databricks to Cosmos DB using MongoDB API?

I have created one azure CosmosDB account using MongoDB API. I need to connect CosmosDB(MongoDB API) to Azure Databricks cluster in order to read and write data from cosmos.

How to connect Azure Databricks cluster to CosmosDB account?

Here is the pyspark piece of code I use to connect to a CosmosDB database using MongoDB API from Azure Databricks (5.2 ML Beta (includes Apache Spark 2.4.0, Scala 2.11) and MongoDB connector: org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 ):

from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .getOrCreate()

df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
  .option("uri", CONNECTION_STRING) \
  .load()

With a CONNECTION_STRING that looks like that: "mongodb://USERNAME:PASSWORD@testgp.documents.azure.com:10255/ DATABASE_NAME.COLLECTION_NAME ?ssl=true&replicaSet=globaldb"

I tried a lot of different other options (adding database and collection names as option or config of the SparkSession) without success. Tell me if it works for you...

After adding the org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 package, this worked for me:

import json

query = {
  '$limit': 100,
}

query_config = {
  'uri': 'myConnectionString'
  'database': 'myDatabase',
  'collection': 'myCollection',
  'pipeline': json.dumps(query),
}

df = spark.read.format("com.mongodb.spark.sql") \
  .options(**query_config) \
  .load()

I do, however, get this error with some collections:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.6, executor 0): com.mongodb.MongoInternalException: The reply message length 10168676 is less than the maximum message length 4194304

Answering the same way I did to my own question.

Using MAVEN as the source, I installed the right library to my cluster using the path

org.mongodb.spark:mongo-spark-connector_2.11:2.4.0

Spark 2.4

An example of code I used is as follows (for those who wanna try):

# Read Configuration
readConfig = {
    "URI": "<URI>",
    "Database": "<database>",
    "Collection": "<collection>",
  "ReadingBatchSize" : "<batchSize>"
  }


pipelineAccounts = "{'$sort' : {'account_contact': 1}}"

# Connect via azure-cosmosdb-spark to create Spark DataFrame 
accountsTest = (spark.read.
                 format("com.mongodb.spark.sql").
                 options(**readConfig).
                 option("pipeline", pipelineAccounts).
                 load())

accountsTest.select("account_id").show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM