简体   繁体   English

Solr索引方法和性能

[英]Solr indexing methods and performance

I am trying to understand how adding to a Solr index works and make sure that I am heading in a good direction. 我试图了解如何添加Solr索引,并确保我朝着正确的方向前进。

The data set contains about 40000 NetCDF files, maybe 250KB each on average. 数据集包含约40000个NetCDF文件,平均每个文件可能为250KB。 For each file I need to index a subset of its metadata, and data. 对于每个文件,我需要为其元数据和数据的子集建立索引。

{
'metadata' :
    {
    'file' : [id, date, ...],
    'identifiers' : [[a, b, c, ...]]
    },
'data' : 
    [[idx, time, lat, lon, a, b, c, ...]]
}

I have written a python script which calls a data subsetting web service using a few query strings, and generates a json object (with schema above) by filtering through all of the data. 我编写了一个python脚本,该脚本使用一些查询字符串调用数据子集Web服务,并通过过滤所有数据来生成json对象(具有上述架构)。 This is for a single file. 这是针对单个文件的。 Everything checks out here (although it could be faster). 一切都在这里检查(尽管可能会更快)。

My plan was to send this json object to Solr directly from the script, and this is where I have a few concerns: 我的计划是直接从脚本中将此json对象发送给Solr,这是我担心的地方:

-- I just created a ~160KB json file. -我刚刚创建了一个〜160KB的json文件。 I need to be memory concious, so I was wondering... do I have to keep this json object lying around in some file for solr to work? 我需要注意内存,所以我想知道...我是否必须将此json对象放在某个文件中才能使solr工作? What happens if I generate a json file, index it, then delete the file? 如果生成一个json文件,将其编入索引,然后删除该文件,会发生什么情况?

-- Can I add the document from within the python script? -我可以从python脚本中添加文档吗? I saw a few libraries that looked promising. 我看到了一些看起来很有前途的图书馆。 I also recall from the documentation a way to send it to the solr url. 我还从文档中回忆了一种将其发送到solr url的方法。 If I must save the json file, can I make a system call to solr's post command, then delete the file after? 如果必须保存json文件,是否可以对solr的post命令进行系统调用,然后再删除该文件?

All I need the index to do is provide a url to the original NetCDF file, and the end user can use the index to gather the relevant info. 我需要索引要做的就是提供原始NetCDF文件的URL,最终用户可以使用索引来收集相关信息。

Does this sound reasonable? 听起来合理吗? Any performance suggestions? 有任何性能建议吗?

Irrespective of your indexing method, you are never required to keep the JSON files. 无论采用哪种索引方法,都无需保留JSON文件。 You can certainly free up the space unless you want to re-index your data. 您当然可以释放空间,除非您想重新索引数据。 Solr stores the all data internally as defined into schema. Solr会将内部定义的所有数据存储到架构中。 If you have fixed schema, then define into schema which all fields you want to store into solr and which all you want to just index. 如果您具有固定的架构,则将要存储到solr中的所有字段以及仅要索引的所有字段定义到架构中。 If you define all the fields as stored fields then you can always re-index even without source data. 如果将所有字段都定义为存储字段,那么即使没有源数据,也始终可以重新编制索引。 Here is sample field declaration. 这是示例字段声明。 Please refer for more details. 请参阅更多详细信息。 Link 链接

<field name="firstname" type="string" indexed="true" stored="true" multivalued="false"/>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM