简体   繁体   中英

Apache Solr - How to index source code files

I want to write a program which is able to search in source code files for specific patterns ... in other words: the input is a piece of code for example:

int fib (int i) {
  int pred, result, temp;

  pred = 1;
  result = 0;

  while (i > 0) {
    temp = pred + result;
    result = pred;
    pred = temp;
    i = i-1;
  }
  return(result);
}

The output are files that contain this piece of code or similar code.

In the Open Source World code is reused in other projects. Especially libraries are often copied into projects. To make bug fixing easier I need to be able to know in which projects specific libraries or code is used.

Therefore I want to try to use apache solr. I don't know if its a good idea (I am would be happy about everything that could help me)

My plan is to index my source code files ... therefore I need some tools? to tokenize source code files. Like give me all names of functions, variables etc. The output I can use to feed the solr index. But I am not sure maybe there are already tokenizer or dataimporthandler in apache solr that do the trick?

I am not sure if this can be done using solr, since different projects may use different naming conventions.

Have a look at the link below if it helps:

Tools for Code Seacrh

Apache Solr is probably not the best option here. You have more like tree/graph comparison problem than string comparison here. I'd recommend using specialized tools for that.

If you do want to do it by hand, you basically need a parser with tree traversal API or some other way to get the stream/tree of tokens. This would very much depend on the language you are parsing. Something like ANTLR might be one way to go if it has the grammar for your language.

Alternatively, you could extract the information from the compiled code, if it is structured enough. For Java, something like ASM may do the job.

But you would still have to figure out the representation. Answering - to yourself - the question of how do I know these two pieces of code are similar should be the right first step.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM