简体   繁体   English

源代码语言分析器

[英]Source code language analyser

I want to detect programming language with ruby 我想用ruby检测编程语言

For example: (PHP) 例如:(PHP)

$a = array("1","2","3");
print_r($a); 

(Ruby) (红宝石)

def index
end

etc. 等等

What gem can do this? 什么宝石可以做到这一点?

Linguist might do that for you (it's what GitHub uses to detect the primary languages in a project). 语言学家可能会为您做到这一点(这是GitHub用于检测项目中主要语言的工具)。

If you're looking to build your own, that would be a good place to start. 如果您希望自己构建,那将是一个不错的起点。 Here are a few more notes on what else you might have to do in order to make one. 以下是一些其他注释,您可能需要做些什么才能制作一个。

File extensions are a good cheat. 文件扩展名是不错的选择。 For example: 例如:

  • .rb - almost always ruby .rb几乎总是红宝石
  • .cpp - almost always C++ .cpp几乎总是C ++
  • .h - could be C/C++ .h可能是C / C ++

...etc., then read the code line by line. ...等,然后逐行读取代码。 There are usually common key words, or the placement of those words within the code that will tip you off pretty quickly as to what language it's written in. A review of several "getting started" tutorial web sites for the languages that you want to support should give you a good summary of these things, without needing to actually learn the languages themselves. 通常有一些常见的关键字,或者这些关键字在代码中的位置,它们会迅速提示您所用的语言。对一些您要支持的语言的“入门”教程网站进行了回顾应该给您一个很好的总结,而无需实际学习语言本身。 All you really need is a few unique things to each language that you can pick up on that makes a file definitively one language or another. 您真正需要的是每种语言的一些独特之处,您可以选择它们,从而使文件确定为一种语言或另一种语言。

You could also use a Bayesian learning filter (there is a module called Classifier in Ruby that appears to do this) to train a more flexible learning engine to identify code by language on its own. 您还可以使用贝叶斯学习过滤器(Ruby中有一个名为Classifier的模块,似乎可以做到这一点)来训练更灵活的学习引擎,以自己的语言识别代码。 Since programming languages are highly structured text, it shouldn't take very long for your learning software to get extremely good at identifying the language. 由于编程语言是高度结构化的文本,因此您的学习软件很快就可以很好地识别语言。 If you wanted to go totally crazy, you could even train it to identify not only the language, but the minimum version of the language that the code can be compiled against. 如果您想发疯,甚至可以训练它以识别语言,不仅识别可以编译代码的语言的最低版本。 For example, in Java, they added generics at a particular point in the language's life cycle. 例如,在Java中,他们在语言生命周期的特定时刻添加了泛型。 If you see the use of generics in the code, then you know that the source was written for a certain minimum version of Java, etc. 如果您在代码中看到泛型的使用,那么您知道源代码是为某个最低版本的Java等编写的。

A little more complex, but not much, will be questions like .erb files. 诸如.erb文件之类的问题会稍微复杂一点,但不会太多。 Do you call those "Embedded Ruby", do you call them "Ruby", or do you count the lines of HTML vs. Ruby vs. JavaScript, and call it by the most numerous language, or do you just tag the file with ALL the found languages? 您是将它们称为“嵌入式Ruby”,还是将其称为“ Ruby”,还是计算HTML与Ruby与JavaScript的行数,并用最多几种语言来调用它,还是只用ALL标记文件找到的语言? I suppose that's really more of a design decision. 我想这实际上更多是设计决定。

Source classifier is a gem that should work for what you want to do. 来源分类器是一种可以满足您想要做的事情的宝石。 Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the "Computer Language Benchmarks Game":http://shootout.alioth.debian.org/. 源分类器使用贝叶斯分类器来识别编程语言,该贝叶斯分类器是在“计算机语言基准游戏”:http://shootout.alioth.debian.org/生成的语料库上训练的。 It is written in Ruby and available as a gem. 它是用Ruby编写的,可以作为宝石使用。 Out of the box SourceClassifier recognises C, Java, Javascript, Perl, Python and Ruby. 开箱即用的SourceClassifier可以识别C,Java,Javascript,Perl,Python和Ruby。 A nice advantage of using a Bayesian classifier to identify the source code is that even false matches will still give some usable highlighting. 使用贝叶斯分类器识别源代码的一个很好的好处是,即使错误匹配也将给出一些可用的突出显示。 To train the classifier to identify new languages download the sources from github . 要训​​练分类器识别新语言,请从github下载源代码。

The only thing I can think about is https://github.com/github/linguist . 我唯一能想到的就是https://github.com/github/linguist A wonderful gem but I don't think it's exactly what you need. 一颗奇妙的宝石,但我认为这并不是您真正需要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM