简体繁体 English

如何使用Zend_Lucene和Zend_Paginator优化大量数据库记录的索引

[英]How to optimize indexing of large number of DB records using Zend_Lucene and Zend_Paginator

原文 2010-04-23 13:39:10 4 1 php/ zend-framework/ indexing/ lucene/ zend-search-lucene

So I have this cron script that is deployed and ran using Cron on a host and indexes all the records in a database table - the index is later used both for the front end of the site and the backed operations as well. 因此，我有一个在主机上使用Cron部署和运行的cron脚本，并为数据库表中的所有记录建立了索引-索引随后既用于站点的前端，也用于支持的操作。

After the operation, the index is about 3-4 MB. 操作后，索引约为3-4 MB。

The problem is it takes a lot of resources (CPU: 30+ and a good chunk of memory) and slows the machine down. 问题在于它占用了大量资源（CPU：30+和大量内存），并降低了计算机的速度。 My question is about how to optimize the operation described below: 我的问题是有关如何优化以下所述的操作：

First there is a select query built using the Zend Framework API, this query is then passed to a Paginator factory that returns a paginator which I am using to balance the current number of items being indexed and not iterate over too much items. 首先，有一个使用Zend Framework API构建的选择查询，然后将该查询传递给Paginator工厂，该工厂返回一个paginator，我正在使用该Paginator平衡当前被索引的项目数，而不是对过多的项目进行迭代。 The script is iterating over the current items in the paginator object using a foreach loop until reaching the end and then it starts from the beginning after getting items for the next page. 该脚本使用foreach循环遍历分页器对象中的当前项目，直到到达末尾，然后在获取下一页的项目后从头开始。

I am suspecting this overhead is caused by the Zend_Lucene but no idea how this could be improved. 我怀疑这种开销是由Zend_Lucene引起的，但是不知道如何改进。

1 个解决方案

See my answer to Can I predict how large my Zend Framework index will be? 请参阅我的回答，我可以预测Zend Framework索引的大小吗？

I tested Zend_Search_Lucene versus Apache Lucene (the Java version). 我测试了Zend_Search_Lucene与Apache Lucene（Java版本）。 In my test, the Java product indexed 1.5 million documents about 300x faster than the PHP product. 在我的测试中，Java产品索引150万份文档的速度比PHP产品快300倍。

You'll be much happier using Apache Solr (the Tomcat container for Apache Lucene). 使用Apache Solr（Apache Lucene的Tomcat容器）会使您更加快乐。 Solr includes a tool called DataImportHandler that sucks data directly from a JDBC data source. Solr包含一个名为DataImportHandler的工具，该工具直接从JDBC数据源中提取数据。

Use the PECL Solr extension to communicate with Solr from PHP. 使用PECL Solr扩展与PHP的Solr进行通信。 If you can't install that PHP extension, use Curl which should be available in default installations of PHP. 如果您无法安装该PHP扩展，请使用Curl ，它应在PHP的默认安装中可用。