简体   繁体   English

Azure Databricks Python 作业

[英]Azure Databricks Python Job

I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database.我需要在 Azure 中近乎实时地解析大量小型非结构化文件,并将解析后的数据加载到 SQL 数据库中。 I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written.我选择了 Python (因为我认为考虑到源文件的数量和它们的大小我认为任何 Spark 集群或大数据都不适合)并且解析逻辑已经写好了。 I am looking forward to schedule this python script in different ways using Azure PaaS我期待使用 Azure PaaS 以不同方式安排此 python 脚本

  1. Azure Data Factory Azure 数据工厂
  2. Azure Databricks Azure 数据块
  3. Both 1+2都 1+2

May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks?请问从 Azure 数据工厂运行 Python 笔记本活动指向 Azure Databricks 的含义是什么? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?我是否能够充分利用集群(驱动程序和工人)的潜力?

Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks?另外,如果您认为必须将脚本转换为 PySpark 才能满足我在 Azure Databricks 中运行的用例要求,请给我建议? The only hesitation here is the files are in KB and they are unstructured.这里唯一的犹豫是文件以 KB 为单位,并且它们是非结构化的。

If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).如果脚本是纯 Python,那么它只会在 Databricks 集群的驱动程序节点上运行,这使得它非常昂贵(并且由于集群启动时间而变慢)。

You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow.您可以重写为 pyspark,但如果数据量如您所说的那么低,那么这仍然昂贵且缓慢。 The smallest cluster will consume two vm's - each with 4 cores.最小的集群将消耗两个虚拟机 - 每个有 4 个内核。

I would look at using Azure Functions instead.我会考虑使用 Azure Functions。 Python is now an option: https://docs.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01 Python 现在是一个选项: https : //docs.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01

Azure Functions also have great integration with Azure Data Factory so your workflow would still work. Azure Functions 还与 Azure 数据工厂有很好的集成,因此您的工作流仍然可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM