简体   繁体   English

如何使用batchSize在Solr中限制数据导入

[英]How to throttle dataimports in Solr using batchSize

I have a requirement to import a large amount of data from a mysql database and index documents (about 1000 documents). 我需要从mysql数据库和索引文档(大约1000个文档)中导入大量数据。 During indexing process I need to do a special processing of a field by sending a enhancement requests to an external Apache Stanbol server. 在索引过程中,我需要通过向外部Apache Stanbol服务器发送增强请求来对字段进行特殊处理。 I have configured my dataimport-handler in solrconfig.xml to use the StanbolContentProcessor in the update chain, as below; 我已经在solrconfig.xml中配置了数据导入处理程序,以在更新链中使用StanbolContentProcessor,如下所示;

<updateRequestProcessorChain name="stanbolInterceptor">
    <processor class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/dataimport" class="solr.DataImportHandler">   
    <lst name="defaults">  
        <str name="config">data-config.xml</str>
        <str name="update.chain">stanbolInterceptor</str>
    </lst>  
</requestHandler>

My sample data-config.xml is as below; 我的示例data-config.xml如下所示;

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" 
                url="jdbc:mysql://localhost:3306/solrTest" 
                user="test" password="test123" batchSize="1" />
    <document name="stanboldata">
        <entity name="stanbolrequest" query="SELECT * FROM documents">
            <field column="id" name="id" />
            <field column="content" name="content" />
            <field column="title" name="title" />
        </entity>
    </document>
</dataConfig>

When running a large import with about 1000 documents, my stanbol server goes down, I suspect due to heavy load from the above Solr Stanbolnterceptor. 当运行包含约1000个文档的大型导入时,我怀疑stanbol服务器停机了,我怀疑是由于上述Solr Stanbolnterceptor的负载很重。 I would like to throttle the dataimport in batches, so that Stanbol can process a manageable number of requests concurrently. 我想分批限制数据导入,以便Stanbol可以同时处理可管理数量的请求。

Is this achievable using batchSize parameter in dataSource element in the data-config? 使用data-config的dataSource元素中的batchSize参数可以做到这一点吗?

Can someone please give some ideas to throttle the dataimport load in Solr? 有人可以提出一些想法来限制Solr中的数据导入负载吗?

This is my custom UpdateProcessor class handling Stanbol requests during /dataimport 这是我的自定义UpdateProcessor类,用于在/ dataimport期间处理Stanbol请求

public class StanbolContentProcessorFactory extends
        UpdateRequestProcessorFactory {

    public static final String NLP_ORGANIZATION = "nlp_organization";
    public static final String NLP_PERSON = "nlp_person";
    public static final String[] STANBOL_REQUEST_FIELDS = { "title", "content" };
    public static final String STANBOL_ENDPOINT = "http://localhost:8080/enhancer";

    @Override
    public UpdateRequestProcessor getInstance(SolrQueryRequest req,
            SolrQueryResponse res, UpdateRequestProcessor next) {

        return new StanbolContentProcessor(next);
    }

    class StanbolContentProcessor extends UpdateRequestProcessor {

        public StanbolContentProcessor(UpdateRequestProcessor next) {
            super(next);
        }

        @Override
        public void processAdd(AddUpdateCommand cmd) throws IOException {
            SolrInputDocument doc = cmd.getSolrInputDocument();
            String request = "";
            for (String field : STANBOL_REQUEST_FIELDS) {
                if (null != doc.getFieldValue(field)) {
                    request += (String) doc.getFieldValue(field) + ". ";
                }

            }
            try {
                EnhancementResult result = stanbolPost(request, getBaseURI());
                Collection<TextAnnotation> textAnnotations = result
                        .getTextAnnotations();
                // extracting text annotations
                Set<String> personSet = new HashSet<String>();
                Set<String> orgSet = new HashSet<String>();
                for (TextAnnotation text : textAnnotations) {
                    String type = text.getType();
                    String selectedText = text.getSelectedText();

                    if (null != type && null != selectedText) {
                        if (type.equalsIgnoreCase(StanbolConstants.DBPEDIA_PERSON)
                                || type.equalsIgnoreCase(StanbolConstants.FOAF_PERSON)) {
                            personSet.add(selectedText);

                        } else if (type
                                .equalsIgnoreCase(StanbolConstants.DBPEDIA_ORGANIZATION)
                                || type.equalsIgnoreCase(StanbolConstants.FOAF_ORGANIZATION)) {
                            orgSet.add(selectedText);

                        }
                    }
                }
                for (String person : personSet) {
                    doc.addField(NLP_PERSON, person);
                }
                for (String org : orgSet) {
                    doc.addField(NLP_ORGANIZATION, org);
                }
                cmd.solrDoc = doc;
                super.processAdd(cmd);
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }

    }

    private EnhancementResult stanbolPost(String request, URI uri) {
        Client client = Client.create();
        WebResource webResource = client.resource(uri);
        ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
                .accept(new MediaType("application", "rdf+xml"))
                .entity(request, MediaType.TEXT_PLAIN)
                .post(ClientResponse.class);

        int status = response.getStatus();
        if (status != 200 && status != 201 && status != 202) {
            throw new RuntimeException("Failed : HTTP error code : "
                    + response.getStatus());
        }
        String output = response.getEntity(String.class);
        // Parse the RDF model

        Model model = ModelFactory.createDefaultModel();
        StringReader reader = new StringReader(output);
        model.read(reader, null);
        return new EnhancementResult(model);

    }


    private static URI getBaseURI() {
        return UriBuilder.fromUri(STANBOL_ENDPOINT).build();
    }

}

The batchSize option is used to retrieve the rows of a database table in batches in order to reduce memory usage (it is often used to prevent running out of memory when running the data import handler). batchSize选项用于批量检索数据库表中的行,以减少内存使用量(通常用于防止在运行数据导入处理程序时耗尽内存)。 While a lower batch size might be slower, the option does not intend to affect the speed of the import process. 虽然较小的批次大小可能会较慢,但该选项并不打算影响导入过程的速度。

My suggestion would be to limit the requests some other way, such as using a firewall rule. 我的建议是以其他方式限制请求,例如使用防火墙规则。 If you are using Linux and have access to Netfilter, you could run something like the following command: 如果您使用的是Linux并且可以访问Netfilter,则可以运行以下命令:

iptables -A INPUT -p tcp --dport 12345 -m limit --limit 10/s -j ACCEPT

Where '12345' is the Stanbol port and '10/s' is the number of packets a second to accept. 其中“ 12345”是Stanbol端口,“ 10 / s”是每秒接受的数据包数量。

Mowgli is right, the batchsize will not help you with this. Mowgli是正确的, batchsize对此无济于事。 Since most people got the problem the other way around (like My dataimport is too slow, please help ) there is nothing like this in Solr. 由于大多数人My dataimport is too slow, please help遇到了问题(例如My dataimport is too slow, please help ),Solr中没有这样的事情。 At least nothing I am aware of. 至少我不知道。


Personally I would not opt to configure your Linux system to handle the throttling for you. 我个人不会选择将Linux系统配置为为您处理限制。 If you move from stage to stage or you migrate to a different server some-when you are required to remember for this. 如果您需要逐步迁移或迁移到其他服务器,则需要记住这一点。 And if people change during the lifetime of your system, they will not know this. 而且,如果人们在您系统的生命周期内进行更改,他们将不会知道这一点。

So, I do not know the code of your StanbolContentProcessorFactory , but as it already got mentioned in your other question it appears to be custom code. 因此,我不知道您的StanbolContentProcessorFactory的代码,但是正如您在其他问题中已经提到的那样它似乎是自定义代码。 As it is your custom code, you might add a throttle mechanism in there. 因为它是您的自定义代码,所以您可以在其中添加一个限制机制。 To elaborate more on that, I would need some code to look at. 要详细说明这一点,我需要查看一些代码。


Update 更新资料

Solr does have Google's guava, so I would use RateLimiter as proposed here . Solr确实有Google的番石榴,所以我将按照此处建议的那样使用RateLimiter If you are building with Maven this would mean, you could use the scope provided . 如果您正在使用Maven进行构建,则意味着可以使用provided的范围。 If you are not using Maven, there is no need to make a fatjar or place guava with Solr's lib folder. 如果您不使用Maven,则无需制作Fatjar或将番石榴放在Solr的lib文件夹中。

import com.google.common.util.concurrent.RateLimiter;

public class StanbolContentProcessorFactory extends
    UpdateRequestProcessorFactory {

    // ...

    // add a rate limiter to throttle your requests
    // this setting would allow 10 requests per second
    private RateLimiter throttle = RateLimiter.create(0.1);

    // ...

    private EnhancementResult stanbolPost(String request, URI uri) {
        Client client = Client.create();

        // this will throttle your requests
        throttle.acquire();

        WebResource webResource = client.resource(uri);
        ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
            .accept(new MediaType("application", "rdf+xml"))
            .entity(request, MediaType.TEXT_PLAIN)
            .post(ClientResponse.class);

        int status = response.getStatus();
        if (status != 200 && status != 201 && status != 202) {
            throw new RuntimeException("Failed : HTTP error code : "
                + response.getStatus());
        }
        String output = response.getEntity(String.class);
        // Parse the RDF model
        Model model = ModelFactory.createDefaultModel();
        StringReader reader = new StringReader(output);
        model.read(reader, null);
        return new EnhancementResult(model);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM