简体   繁体   中英

Why does Solr delete documents after import

I import data from MySql over an Dataimporthandler. This works very well and I get this message:

Indexing completed. Added/Updated: 2,172 documents. Deleted 0 documents. (Duration: 01s) Requests: 1 (1/s), Fetched: 2,172 (2,172/s), Skipped: 0, Processed: 2,172 (2,172/s)

But when I look at my Overview it says:

Num Docs: 1470 Max Doc: 2172 Deleted Docs: 702

So 702 documents got deleted for a reason I cannot really figure out. In my schema I don't use any unique field or something that could make some trouble for duplicates.


data-config.xml

<dataConfig>
  <dataSource type="JdbcDataSource"
    driver="com.mysql.jdbc.Driver"
    url="xxx"
    user="xxx"
    password="xxx"
  />
  <document>
   <entity name="product" query="CALL getSolrProducts();" transformer="RegexTransformer">
      <field column="uuid" name="uuid"/>
      <field column="id" name="id"/>
      <field column="productimage" name="productimage"/>
      <field column="producturl" name="producturl"/>
      <field column="productpricenew" name="productpricenew"/>
      <field column="productpriceold" name="productpriceold"/>
      <field column="brandid" name="productbrand"/>
      <field column="productbrandname" name="productbrandname"/>
      <field column="productbrandurl" name="productbrandurl"/>
      <field column="productbrandimage" name="productbrandimage"/>
      <field column="productbranddata" name="productbranddata"/>
      <field column="productshippingcoast" name="productshippingcoast"/>
      <field column="productlink" name="productlink"/>
      <field column="color" name="color" splitBy=","/>
      <field column="colordata" name="colordata" splitBy=","/>
      <field column="productdescription" name="productdescription"/>
      <field column="upc" name="upc" splitBy=","/>
      <field column="productname" name="productname"/>
      <field column="productshop" name="productshop"/>
      <field column="productshopname" name="productshopname"/>
      <field column="productshopimage" name="productshopimage"/>
      <field column="productimagethumb" name="productimagethumb"/>
      <field column="productshopdata" name="productshopdata"/>
    <field column="cat1id" name="cat1id"/>
    <field column="cat2id" name="cat2id"/>
    <field column="cat3id" name="cat3id"/>
    <field column="cat4id" name="cat4id"/>
    <field column="cat1data" name="cat1data"/>
    <field column="cat2data" name="cat2data"/>
    <field column="cat3data" name="cat3data"/>
    <field column="cat4data" name="cat4data"/>
      <field column="size" name="size" splitBy=","/>
      <field column="sizedata" name="sizedata" splitBy=","/>
      <field column="recommendations" name="recommendations" splitBy=","/>
    </entity>
  </document>
</dataConfig>

Anyone a pointer?

Since you checked clean , DIH first issues a "delete all" update query and then starts posting new documents. Once indexing finishes, DIH issues a commit, which will only keep the new documents that got posted and delete all the old documents which existed before the indexing started. Your database must have gotten updated, so you got more docs now and the 702 deleted docs correspond to the documents that existed in your index before indexing started. (Checking optimize in DIH will purge the deleted documents, but optimize may be expensive for large indexes and the deleted docs do not show up in search results anyway, so may not be of much benefit.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM