简体   繁体   English

为数据科学项目组织 python 脚本

[英]Organize python scripts for a data science project

I have a folder with many subfolders and each subfolder has (n) python scripts that do a variety of tasks for the project.我有一个包含许多子文件夹的文件夹,每个子文件夹都有 (n) python 脚本,它们为项目执行各种任务。 It could be data analysis, call other scripts, automate some stuff etc. Some of these scripts are related to each other, some are standalone.它可以是数据分析、调用其他脚本、自动化一些东西等。这些脚本中的一些是相互关联的,一些是独立的。 But they are all part of the same 'project'.但它们都是同一个“项目”的一部分。

I come from Java world and I am used to packaging everything related to a project in a.jar file.我来自 Java 世界,我习惯于将与项目相关的所有内容打包在 a.jar 文件中。 Is there something similar I can do to organize these wild python scripts, even to the point of giving a common entry point into all of them?我可以做一些类似的事情来组织这些狂野的 python 脚本,甚至可以为所有这些脚本提供一个共同的入口点吗?

Yes, you can package the project and install it.是的,您可以 package 项目并安装它。 I like using Poetry for dependency management, virtualenvs, and packaging.我喜欢使用Poetry进行依赖管理、virtualenvs 和打包。 It can install your project either in a virtual or global environment, or build a *.whl for your project that makes it pip installable elsewhere (in a docker container, cloud resource, etc.).它可以在虚拟或全局环境中安装您的项目,或者为您的项目构建一个 *.whl,使其pip可安装在其他地方(在 docker 容器、云资源等中)。 It's sort of like a Maven for Python.它有点像 Python 的 Maven。

As for the "wild scripts", there's no reason why your python code has to be disorganized.至于“狂野脚本”,没有理由让您的 python 代码杂乱无章。 All of your usual hygiene around clean code, modular OO design patterns, good encapsulation, dependency injection, etc. are still encouraged;仍然鼓励您在清洁代码、模块化 OO 设计模式、良好封装、依赖注入等方面的所有常规习惯; python just won't force those guard rails onto you like Java so it's very much "bring your own good habits". python 不会像 Java 那样将那些护栏强加给你,所以它非常“带上你自己的好习惯”。 I often organize my python project into modular java-esque subpackages where my domain models and other reusable components are defined.我经常将我的 python 项目组织成模块化的 java-esque 子包,其中定义了我的域模型和其他可重用组件。 These can then be imported in any scripts I write in my scripts folder.然后可以将这些导入到我在脚本文件夹中编写的任何脚本中。 This also makes the scripts themselves quite a bit more maintainable and orderly.这也使脚本本身更易于维护和有序。

A rough example structure could be (depending on the type of project)粗略的示例结构可能是(取决于项目的类型)

project-root
   - domain
     - domain_model_a
     - domain_model_b
   - training
     - machine_learning_model
   - storage
     - repository
        - domain_model_a_repository
        - domain_model_b_repository
     - service
        - elastic_search_service
   - script
        - script_that_does_X
        - ml_training_script

Lastly, it can be nice for your dev workflow to toss in if __name__ == "__main__": scripts at the bottom of your components if you want to be able to jump into and invoke, say, a repository or service as a script for debugging purposes and inform what kinds of integration tests you want to write.最后, if __name__ == "__main__":如果您希望能够跳转到并调用存储库或服务作为脚本调试目的并告知您要编写什么样的集成测试。

You may want to have a look at this cookiecutter for data science projects.你可能想看看这个用于数据科学项目的cookiecutter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM