简体   繁体   中英

How to Connect Pyspark to datastax Cassandra that is running on the docker?

I am running Datastax Cassandra on Docker, and I create my table on Datastax Cassandra, but I want to install Pyspark container with this docker-compose.yml, but I don't know how do I set network of docker-compose.yml file to connect Datastax Cassandra and Pyspark container together.

this is docker-compose.yml for running pyspark :

  spark:
    image: jupyter/pyspark-notebook
    container_name: pyspark
    ports:
      - "8888:8888"
      - "4040:4040"
      - "4041:4041"
      - "4042:4042"

    expose:
      - "8888"
      - "4040"
      - "4041"
      - "4042"

    environment:
      CHOWN_HOME: "yes"
      GRANT_SUDO: "yes"
      NB_UID: 1000
      NB_GID: 100
    deploy:
     replicas: 1
     restart_policy:
       condition: on-failure
    volumes:
    - ./Documents:/home/jovyan/work

,and this is docker command for creating Datastax Cassandra container :

docker run \
    -e \
    DS_LICENSE=accept \
    --memory 4g \
    --name my-dse \
    -d \
    -v /Documents/datastax/cassandra:/lib/cassandra \
    -v /Documents/datastax/spark:/lib/spark \
    -v /Documents/datastax/dsefs:/lib/dsefs \
    -v /Documents/datastax/log/cassandra:/log/cassandra \
    -v /Documents/datastax/log/spark:/log/spark \
    -v /Documents/datastax/config:/config \
    -v /Documents/datastax/opscenter:/lib/opscenter \
    -v /Documents/datastax/datastax-studio:/lib/datastax-studio \
    datastax/dse-server:6.8.4 \
    -g \
    -s \
    -k

please help me to write the docker-compose.yml to run connected Pyspark to Cassandra Datastax for reading data from it.

By default, docker-compose should setup the common network if both containers are started by it, so you can just use DSE container name for spark.cassandra.connection.host parameter.

If both containers aren't maintained by docker-compose, then you may (you'll need to set spark.cassandra.connection.host parameter correctly):

  • just use the internal IP of the DSE container: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' my-dse
  • use legacy Docker links (not recommended, really) and use DSE container name for connection
  • use docker network connect (see documentation ), as well with the DSE container name
  • start DSE Docker image with port 9042 exposed to the outside, and use host's IP for connection

PS If you'll have pyspark in the Jupyter container, then you don't need to pass -k flag because it will start Spark on DSE as well, and it's not very good with 4Gb of RAM. Also, if you don't need DSE Graph, remove the -g switch.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM