简体   繁体   中英

Source code language analyser

I want to detect programming language with ruby

For example: (PHP)

$a = array("1","2","3");
print_r($a); 

(Ruby)

def index
end

etc.

What gem can do this?

Linguist might do that for you (it's what GitHub uses to detect the primary languages in a project).

If you're looking to build your own, that would be a good place to start. Here are a few more notes on what else you might have to do in order to make one.

File extensions are a good cheat. For example:

  • .rb - almost always ruby
  • .cpp - almost always C++
  • .h - could be C/C++

...etc., then read the code line by line. There are usually common key words, or the placement of those words within the code that will tip you off pretty quickly as to what language it's written in. A review of several "getting started" tutorial web sites for the languages that you want to support should give you a good summary of these things, without needing to actually learn the languages themselves. All you really need is a few unique things to each language that you can pick up on that makes a file definitively one language or another.

You could also use a Bayesian learning filter (there is a module called Classifier in Ruby that appears to do this) to train a more flexible learning engine to identify code by language on its own. Since programming languages are highly structured text, it shouldn't take very long for your learning software to get extremely good at identifying the language. If you wanted to go totally crazy, you could even train it to identify not only the language, but the minimum version of the language that the code can be compiled against. For example, in Java, they added generics at a particular point in the language's life cycle. If you see the use of generics in the code, then you know that the source was written for a certain minimum version of Java, etc.

A little more complex, but not much, will be questions like .erb files. Do you call those "Embedded Ruby", do you call them "Ruby", or do you count the lines of HTML vs. Ruby vs. JavaScript, and call it by the most numerous language, or do you just tag the file with ALL the found languages? I suppose that's really more of a design decision.

Source classifier is a gem that should work for what you want to do. Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the "Computer Language Benchmarks Game":http://shootout.alioth.debian.org/. It is written in Ruby and available as a gem. Out of the box SourceClassifier recognises C, Java, Javascript, Perl, Python and Ruby. A nice advantage of using a Bayesian classifier to identify the source code is that even false matches will still give some usable highlighting. To train the classifier to identify new languages download the sources from github .

The only thing I can think about is https://github.com/github/linguist . A wonderful gem but I don't think it's exactly what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM