简体   繁体   English

如何将整个本地硬盘索引到Apache Solr?

[英]How to index entire local Hard Drive into Apache Solr?

Is there a good approach with Solr or a client library feeding into Solr to index an entire hard drive. Solr或提供给Solr的客户端库是否有索引整个硬盘的好方法? This should include content in the zip files, including recursively of zip files within zip files? 这应包括zip文件中的内容,包括递归zip文件中的zip文件吗?

This should be able to run on Linux (no windows-only clients). 这应该能够在Linux上运行(没有仅Windows客户端)。

This will of course involve making a single scan over the entire file-system from the root (or any folder actually). 当然,这将涉及从根目录(或实际上的任何文件夹)对整个文件系统进行一次扫描。 I'm not concerned at this point with keeping the index up to date, just creating it initially. 在这一点上,我不关心保持索引为最新状态,而只是最初创建它。 This would be similar to the old "Google Desktop" app, which Google discontinued. 这将类似于旧的“ Google桌面”应用,但该应用已停止使用。

You can manipulate Solr using the SolrJ API. 您可以使用SolrJ API操纵Solr。

Here's the API documentation: http://lucene.apache.org/solr/4_0_0/solr-solrj/index.html 以下是API文档: http : //lucene.apache.org/solr/4_0_0/solr-solrj/index.html

And here's a article on how to use SolrJ to index files on your harddrive. 这是有关如何使用SolrJ索引硬盘驱动器上的文件的文章。
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/ http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/

Files are represented by InputDocument and you use .addField to attach fields that you'd like to search on at a later time. 文件由InputDocument表示,您可以使用.addField附加要在以后搜索的字段。

Here's example code for an Index Driver: 以下是索引驱动程序的示例代码:

public class IndexDriver extends Configured implements Tool {     

  public static void main(String[] args) throws Exception {
    //TODO: Add some checks here to validate the input path
    int exitCode = ToolRunner.run(new Configuration(),
     new IndexDriver(), args);
    System.exit(exitCode);
  }

  @Override
  public int run(String[] args) throws Exception {
    JobConf conf = new JobConf(getConf(), IndexDriver.class);
    conf.setJobName("Index Builder - Adam S @ Cloudera");
    conf.setSpeculativeExecution(false);

    // Set Input and Output paths
    FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
    FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
    // Use TextInputFormat
    conf.setInputFormat(TextInputFormat.class);

    // Mapper has no output
    conf.setMapperClass(IndexMapper.class);
    conf.setMapOutputKeyClass(NullWritable.class);
    conf.setMapOutputValueClass(NullWritable.class);
    conf.setNumReduceTasks(0);
    JobClient.runJob(conf);
    return 0;
  }
}

Read the article for more info. 阅读该文章以获取更多信息。

Compressed files Here's info on handling compressed files: Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats 压缩文件以下是有关处理压缩文件的信息: 使用Solr CELL的ExtractingRequestHandler从包格式中索引/提取文件

There seems to be some bug with Solr not handling zip files, here's the bugreport with a fix: https://issues.apache.org/jira/browse/SOLR-2416 Solr似乎无法处理zip文件存在一些错误,这是带有修复程序的错误报告: https : //issues.apache.org/jira/browse/SOLR-2416

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM