简体繁体 English

如何微调GitHub Copilot？

[英]How to fine tune fine tune GitHub Copilot?

原文 2022-06-09 03:12:02 9 3 github/ deep-learning/ openai/ codex/ github-copilot

We can fine tune language models like BERT , GPT-3 .我们可以微调BERT 、 GPT-3等语言模型。

Can I fine tune GitHub Copilot model?我可以微调GitHub Copilot model 吗？

I have already looked into examples from https://copilot.github.com/ but cant find the details.我已经查看了https://copilot.github.com/ 中的示例，但找不到详细信息。

Would really appreciate if someone had fine tuned Github Copilot.如果有人对 Github Copilot 进行了微调，我将不胜感激。

3 个解决方案

There does not seem to be a client-facing feature allowing you to fine-tune Copilot directly.似乎没有面向客户端的功能允许您直接微调 Copilot。

Here are two illustration as to why this feature is, for now (Q2 2022) missing.以下是关于为什么目前（2022 年第二季度）缺少此功能的两个说明。

The Copilot feature page initially included this: Copilot 功能页面最初包括以下内容：

How will GitHub Copilot get better over time? GitHub Copilot 将如何随着时间的推移变得更好？

GitHub Copilot doesn't actually test the code it suggests, so the code may not even compile or run. GitHub Copilot 实际上并未测试它建议的代码，因此代码甚至可能无法编译或运行。 GitHub Copilot can only hold a very limited context, so even single source files longer than a few hundred lines are clipped and only the immediately preceding context is used. GitHub Copilot 只能保存非常有限的上下文，因此即使是长度超过几百行的单个源文件也会被剪裁，并且仅使用前一个上下文。 And GitHub Copilot may suggest old or deprecated uses of libraries and languages. GitHub Copilot 可能会建议使用旧的或不推荐使用的库和语言。 You can use the code anywhere, but you do so at your own risk.您可以在任何地方使用该代码，但风险自负。

As Tomek Korbak explains on Twitter :正如Tomek Korbak 在 Twitter 上解释的那样：

Actually, Copilot's completions will always be optimised for human's liking, not necessarily compiler's liking.实际上，Copilot 的补全总是会根据人类的喜好进行优化，而不一定是编译器的喜好。

That's because the language model training objective (predicting the next token in text) is great at capturing short-term dependencies (which explains the human feel of generated snippets).这是因为语言 model 训练目标（预测文本中的下一个标记）非常适合捕获短期依赖关系（这解释了生成片段的人类感觉）。

But it struggles to capture long-term, global, semantic properties of generated sequences such as compilability.但它难以捕捉生成序列的长期、全局、语义特性，例如可编译性。 And there's no easy way of including compilability as a signal for their training.并且没有简单的方法将可编译性作为他们训练的信号。

The standard way -- fine-tuning language models using RL with compilability as a reward -- notoriously leads to catastrophic forgetting: less diverse and less accurate completions.标准的方法——使用 RL 微调语言模型并以可编译性作为奖励——众所周知会导致灾难性的遗忘：更少的多样性和更不准确的完成。

Tomek references " Energy-Based Models for Code Generation under Compilability Constraints (pdf) " Tomek 参考“在可编译性约束下用于代码生成的基于能量的模型 (pdf) ”

Our solution (KL-DPG) boosts compilability rate of generated sequences from 55% to 70%.我们的解决方案 (KL-DPG) 将生成序列的可编译率从 55% 提高到 70%。
RL fine-tuning can do better but at a cost of catastrophic forgetting. RL 微调可以做得更好，但代价是灾难性的遗忘。

Overall, energy-based models (EBMs) turn out to be great at expressing weird, sequence-level constraints that would be super hard as to express as normalised priors for autoregressive language models.总体而言，基于能量的模型 (EBM) 非常擅长表达怪异的序列级约束，这些约束很难表达为自回归语言模型的规范化先验。

EBMs provide a way of injecting our structured, symbolic knowledge into large language models without breaking them down or sacrificing their uncanny abilities. EBM 提供了一种将我们结构化的符号知识注入大型语言模型的方法，而不会破坏它们或牺牲它们不可思议的能力。
The space of further applications in controllable generation is huge.可控发电进一步应用空间巨大。

So not so easy.所以没那么容易。

Tanishq Mathew Abraham explains in " Coding with GitHub Copilot " Tanishq Mathew Abraham在“使用 GitHub Copilot 进行编码”中进行了解释

I wonder if the GitHub team might also develop a way of perhaps fine-tuning GitHub Copilot to specific use-cases.我想知道 GitHub 团队是否也可能开发一种方法来微调 GitHub Copilot 以适应特定的用例。

For example, there may be a specific GitHub Copilot models for fastai, JAX, etc. They would be fine-tuned on the source code of of these libraries and codebases that use these libraries.例如，可能有用于 fastai、JAX 等的特定 GitHub Copilot 模型。它们将根据这些库的源代码和使用这些库的代码库进行微调。

But making sure that the tool does not provide outdated suggestions would still be a challenge.但是确保该工具不提供过时的建议仍然是一个挑战。
I don't think it would be possible to provide suggestions for a brand-new library that does not have enough codebases using it to train on.我认为不可能为没有足够代码库来训练的全新库提供建议。

Additionally, for situations like fastai where there are older APIs and newer APIs, when fine-tuning a model, the codebases using the older APIs would have to be filtered out.此外，对于像 fastai 这样有旧 API 和新 API 的情况，在微调 model 时，必须过滤掉使用旧 API 的代码库。

OpenAI API offers the "Davinci Codex" machine learning model with a pay-per-hit subscription, similar to the the non-coding version of the davinci model. OpenAI API 提供“Davinci Codex”机器学习 model 并按次付费订阅，类似于 davinci Z20F35E630DAF44DBFA4C3F68F5399D8 的非编码版本。

OpenAI should enable the fine-tuning option to Davinci Codex as well. OpenAI 也应该启用 Davinci Codex 的微调选项。 When they do it you will be able to use it via API calls.当他们这样做时，您将能够通过 API 调用来使用它。

After checking that prerequisite, I think you could link the OpenAI API to your local installation of Github Copilot via some code changes, in theory that should be possible.在检查了该先决条件后，我认为您可以通过一些代码更改将 OpenAI API 链接到 Github Copilot 的本地安装，理论上这应该是可能的。

The first step is probably to have a fork of the copilot VSCode extension that calls the OpenAI Codex API (or an entirely custom extension which inserts text in your code)第一步可能是拥有一个调用 OpenAI Codex API 的副驾驶 VSCode 扩展的分支（或在代码中插入文本的完全自定义扩展）

Then you would point it to your fine-tuned version of the model.然后你会将它指向你的 model 的微调版本。 To learn about fine-tuning OpenAI models you should look at their documentation:要了解微调 OpenAI 模型，您应该查看他们的文档：

https://beta.openai.com/docs/guides/fine-tuning https://beta.openai.com/docs/guides/fine-tuning

Note that they have also a openai CLI that allows you to do most of the data loading and fine tuning tasks.请注意，他们还有一个openai CLI，可让您完成大部分数据加载和微调任务。

Unfortunately at the moment you can only fine tune the non-coding version of OpenAI models, hope that they will make available Codex soon.不幸的是，目前您只能微调 OpenAI 模型的非编码版本，希望它们能尽快提供 Codex。

No, not at all.一点都不。 There is no storing of GitHub Pilot's model in the client's system and also no access is provided to the model since now they have started charging for their services hence making it more obvious that their project isn't and won't be open-sourced.客户系统中没有 GitHub Pilot 的 model 的存储，也没有提供对 model 的访问权限，因为现在他们已经开始为他们的服务收费，因此不会更明显地开源。