简体   繁体   中英

Run DELETE cleanup query after a Solr dataimport

I'm working on a Solr dataimport from an Oracle database. The database system has a set of tables dedicated to storing references to changes in other tables. For example, I might have a table named PERSON , and when records are added to this table, their IDs are added to the PERSON_CHANGED table. I'd like to use this PERSON_CHANGED table when defining my deltaQuery so that Solr only indexes the changed records in subsequent indexes. As part of this process, I need to remove records that I've read from the PERSON_CHANGED table after Solr finishes its import (either delta or full), so that I don't process them again later.

What's the best way to run this kind of "cleanup" SQL query after a dataimport ?

I've tried combining both of the queries like this (simplified for brevity):

<dataConfig>
    <dataSource ... >
    <document>
        <entity name="person"
                query="
                    SELECT ID, FIRST_NAME, LAST_NAME
                    FROM PERSON
                    WHERE '${dataimporter.request.clean}' != 'false'
                        OR PERSON_ID IN (
                            SELECT ID FROM CHANGED_PERSON
                        );

                    DELETE * (
                        SELECT * FROM CHANGED_PERSON
                    );
        " />
    </document>
</dataConfig>

But this results in a SQL command not properly ended error. Does Solr provide a way to do this kind of cleanup?

Once you're using delta import in SOLR, solr won't process twice your record, since you will keep track of this records every time you will run

Ref doc:

When delta-import command is executed, it reads the start time stored in conf/dataimport.properties.

link: https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example

As part of your question, I can imagine that you're trying to perform full import every time that you run the deltaimport (full import runs cleanup in solr indexes ... etc). This is not the proper way to do deltaimport.

What I would recommand you is : 1) perform delta import (and not full import) 2) once every X days, X month, if your need to, perform a clean import Better to do it in another core, so that your service continues running and you will only replace the cores.

I found a way to accomplish this cleanup task, but I'm not super happy with it. I can define a separate entity whose query runs a DELETE :

<dataConfig>
    <dataSource ... >
    <document>
        <entity name="person"
                query="
                    SELECT ID, FIRST_NAME, LAST_NAME
                    FROM PERSON
                    WHERE '${dataimporter.request.clean}' != 'false'
                        OR PERSON_ID IN (
                            SELECT ID FROM CHANGED_PERSON
                        )" />

        <entity name="deleteChangedPersonRecords"
                query="DELETE FROM CHANGED_PERSON" />
    </document>
</dataConfig>

This seems to work, but it's a bit of a hack, and it relies on the assumption that Solr executes its entity queries in the same order that they are specified in the file. If anyone has a better solution, please feel free to add your answer to this question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM