简体   繁体   English

如何使用pickle文件组织Python项目?

[英]How to organize a Python project with pickle files?

I am coming from Java background and completely new at Python. 我来自Java背景,而且是Python的全新。

Now I have got a Python project that consists of a few Python scripts and pickle files stored in Git. 现在我有一个Python项目,包含一些Python脚本和存储在Git中的pickle文件。 The pickle files are serialized sklearn models. pickle文件是序列化的sklearn模型。

I wonder how to organize this project. 我想知道如何组织这个项目。 I think we should not store the pickle files in Git. 我认为我们不应该将pickle文件存储在Git中。 We should probably store them as binary dependencies somewhere. 我们应该将它们存储为某处的二进制依赖项。

Does it make sense ? 是否有意义 ? What is a common way to store binary dependencies of Python projects 存储Python项目的二进制依赖项的常用方法是什么

Git is just fine with binary data. Git对二进制数据很好。 For example, many projects store eg images in git repos. 例如,许多项目在git repos中存储例如图像。

I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. 我想,经验法则是决定你的二进制文件是源材料,外部依赖项还是中间构建步骤。 Of course, there are no strict rules, so just decide how you feel about them. 当然,没有严格的规则,所以只需决定你对它们的看法。 Here are my suggestions: 以下是我的建议:

  1. If they're (reproducibly) generated from something, .gitignore the binaries and have scripts that build the necessary data. 如果它们(可重复地)从某些东西生成,则.gitignore二进制文件并具有构建必要数据的脚本。 It could be in the same, or in a separate repo - depending on where it feels best. 它可以是相同的,也可以是单独的回购 - 取决于它感觉最好的地方。

  2. Same logic applies if they're obtained from some external source, eg an external download. 如果它们是从某些外部源获得的,例如外部下载,则适用相同的逻辑。 Usually, we don't store dependencies in the repository - we only keep references to them. 通常,我们不会在存储库中存储依赖项 - 我们只保留对它们的引用。 Eg we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle. 例如,我们不保留virtualenvs但只保留requirements.txt文件 - Java世界类比(粗略近似)就像没有.jars但只有pom.xml或build.gradle中的依赖项部分。

  3. If they can be considered to be a source material, eg if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository. 如果它们可以被认为是源材料,例如,如果您使用Python作为编辑器操作它们 - 不要担心文件的二进制特性,只需将它们放在存储库中即可。

  4. If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. 如果它们不是真正的源材料,但它们的生成过程非常复杂或需要很长时间,并且文件不是要定期更新 - 我认为将它们放入其中并不是非常错误回购。 Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course. 当然,留下关于如何生成文件的注释(README.txt或其他内容)是个好主意。

Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs. 哦,如果文件很大(比如数百兆或更多),请考虑看一下git-lfs。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM