简体   繁体   中英

Building a Production-grade Data Science environment at home - Questions around orchestration

I hope you can help me here. I am working on creating a small environment at home for Data Science. I am having trouble understanding how to create the orchestration layer properly (I am also not convinced that the other components of the architecture I have selected are the most appropriated). If anyone has some experience with any of this components and can give me some recommendations I would appreciate greatly.

I am using old computers and laptops to create the environment (cheaper than using the cloud), some of them with NVIDIA GPUs. So here is the architecture I have in mind.

  • For the underlaying infrastructure, I am using Docker with Docker Swarm.
  • I have 3 layers of storage. SSD for hot data (on 1 of the servers), several normal Drives of each different PC joined through GlusterFS for the database data, and an NFS Volume from my NAS for Archival.
  • I have a Container already with a GPU Version of JupyterLab (potentially for using tensorflow or pytorch) for development purposes.
  • Another container with GitLab for Version Control/CI
  • Another container with Apache NIFI for Real Time Data Ingestion. I am thinking of using also Kafka for better managing the stream data asynchronously (data comes from a websocket)
  • Apache Druid as the database for the data

So, here it comes my question: Assuming I develop an algorithm that requires training, and I need to orchestrate a re-training from time to time of the model. How do I perform the retraining automatically? I know I can use nifi (I could use alternatively apache airflow), but the re-training needs to be executed on a GPU-docker container. Can I just simply prepare a docker container with gpu and python and somehow tell Nifi (or airflow) that it needs to execute the operations on that container (I don't even know if is possible to do that).

Another question is, for performing operations on real-time as the data lands. Will using kafka and druid suffice, or should I think of using Spark Streaming? I am looking into executing transformations of data, passing the data through the models, etc. Also potentially sending POST commands to an API depending on the data results.

I am used to work only on development environment (Jupyter), so when it comes to putting things on production, I have lots of gaps on how things work. Hence the purpose of this is to practice how different components work together and practice different technologies (Nifi, Kafka, Druid, etc).

I hope you can help me.

Thanks in advance.

My first thought is, there are awesome classes and coursera that are free and use free tier of GCP/AWS/Azure. If you're trying to become familiar with data engineering required to support data science, that's where I'd start. They'll walk you through the stack and ops.

If you're determined to build it yourself, then, like you mentioned, you're going to want to pre-process the data before sending to Druid.

https://www.decodable.co/blog/top-6-patterns-for-ingesting-data-into-druid

For stream processing, you could use decodable.co's free tier, 500GB/day. Do you have a datagen up and running, or are you pulling from a twitter firehouse, or?

To run task in specific container it's easy to use DockerOperator of Apache Airflow. Typically you need to provide CLI to start training, and call this CLI in container through Airflow. Ref: https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM