简体   繁体   English

Solr 使用数据处理程序导入后未返回所有文档

[英]Solr not returning all documents after importing with the Data Handler

I have a Solr 8.7.0 installation and I'm using the Data Handler importer plugin via a MySQLi connection.我有一个 Solr 8.7.0 安装,我正在通过 MySQLi 连接使用数据处理程序导入器插件。

I have four entities declared:我声明了四个实体:

<dataConfig>
  <dataSource type="JdbcDataSource"
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/hmsscot_bassculture"
              user="myuser"
              password="mypw"/>
  <document>
    <entity name="author" query="select id,type,firstname,surname,biographical_info,extrainfo from bassculture_author">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="firstname" name="firstname"/>
      <field column="surname" name="surname"/>
      <field column="biographical_info" name="biographical_info"/>
      <field column="extrainfo" name="extrainfo"/>
    </entity>

    <entity name="source" query="select id,type,short_title,full_title,publisher,author_id,orientation,variants from bassculture_source">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="short_title" name="short_title"/>
      <field column="full_title" name="full_title"/>
      <field column="publisher" name="publisher"/>
      <field column="author_id" name="author_id"/>
      <entity name="author" query="SELECT s.*, CONCAT(ba.firstname, ' ', ba.surname) AS author FROM bassculture_source s, bassculture_author ba WHERE s.id=${source.id} AND s.author_id = ba.id;">
        <field column="author" name="author"/>
      </entity>
      <field column="description" name="description"/>
      <field column="orientation" name="orientation"/>
      <field column="variants" name="variants"/>
    </entity>

    <entity name="copy" query="select id,type,folder,source_id,item_notes,seller,library,shelfmark,pagination,dimensions from bassculture_item">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="folder" name="folder"/>
      <field column="source_id" name="source_id"/>
      <entity name="source_title" query="select id,short_title from bassculture_source where id=${copy.source_id}">
        <field column="short_title" name="source_title"/>
      </entity>
      <entity name="source_author" query="SELECT bt.*, CONCAT(ba.firstname, ' ', ba.surname) AS source_author FROM bassculture_tune bt, bassculture_item c, bassculture_source s, bassculture_author ba WHERE c.id=${copy.id} AND c.source_id = s.id AND s.author_id = ba.id;">
        <field column="source_author" name="source_author"/>
      </entity>
      <field column="item_notes" name="item_notes"/>
      <field column="seller" name="seller"/>
      <field column="library" name="library"/>
      <field column="shelfmark" name="shelfmark"/>
      <field column="paginations" name="pagination"/>
      <field column="dimensions" name="dimension"/>
    </entity>

    <entity name="tune" query="select id,type,name,start_page,alternate_spellings,item_id from bassculture_tune">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="name" name="name"/>
      <entity name="source_title" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt where bt.id=${tune.id} AND c.source_id = s.id AND bt.item_id = c.id">
        <field column="short_title" name="source_title"/>
      </entity>
      <entity name="tune_author" query="SELECT bt.*, CONCAT(ba.firstname, ' ', ba.surname, ' ', ba.extrainfo) AS tune_author FROM bassculture_tune bt, bassculture_item c, bassculture_source s, bassculture_author ba WHERE bt.id=${tune.id} AND bt.item_id = c.id AND c.source_id = s.id AND s.author_id = ba.id;">
        <field column="tune_author" name="tune_author" />
      </entity>
      <field column="start_page" name="start_page"/>
      <field column="alternate_spellings" name="alternate_spellings"/>
      <field column="item_id" name="item_id"/>
    </entity>

  </document>
</dataConfig>

Now, I'm experiencing something which doesn't make sense to me.现在,我正在经历一些对我来说没有意义的事情。 If I run the data importer leaving the 'entity' drop-down blank (ie import all entities):如果我运行数据导入器,将“实体”下拉列表留空(即导入所有实体):

在此处输入图像描述

I get:我得到:

Indexing completed.索引完成。 Added/Updated: 2357 documents.添加/更新:2357 个文档。 Deleted 0 documents.删除了 0 个文档。 (Duration: 13s) (时长:13 秒)

This is the correct number of documents (authors+sources+copies+tunes).这是正确的文档数量(作者+来源+副本+曲调)。 Nevertheless, when I query the database I only get 1938 documents:然而,当我查询数据库时,我只得到 1938 个文档:

  "responseHeader":{
    "status":0,
    "QTime":103,
    "params":{
      "q":"*:*",
      "_":"1609335106436"}},
  "response":{"numFound":1938,"start":0,"numFoundExact":true,"docs":[
      {
    [...]

This are only the tunes (last entity in the configuration file above).这只是曲调(上面配置文件中的最后一个实体)。 I also see this in the dashboard:我还在仪表板中看到了这一点:

在此处输入图像描述

If on the other hand I select the entities one by one (eg author etc...):另一方面,如果我 select 实体一个接一个(例如作者等......):

在此处输入图像描述

the plugin imports correctly the author, tune, and copy entities (each time the . query reflects the documents imported).该插件正确导入作者、调整和复制实体(每次.查询都反映了导入的文档)。 Once I get to the fourth entity though (tune), the index apparently 'forgets' about the previous three entities - although after running it, plugin reports 'documents deleted: 0' - and the .一旦我到达第四个实体(调整),索引显然会“忘记”前三个实体 - 尽管在运行它之后,插件报告“文档已删除:0” - 和. query goes back to only 1938 documents found (ie only tunes).查询仅返回找到的 1938 个文档(即仅曲调)。

There's no error message in the logs.日志中没有错误消息。 What am I missing?我错过了什么?

PARTIAL SOLUTION部分解决方案

I managed to add a prefix to the id in order to differentiate the four different data, so that unique IDs don't get rewritten, eg:我设法为 id 添加了一个前缀,以区分四种不同的数据,这样唯一的 ID 就不会被重写,例如:

SELECT name,start_page,alternate_spellings,item_id, CONCAT('tune_', id) AS id, 'tune' as type FROM bassculture_tune;

Nevertheless, I need the database id (without the prefix) of the current tune, in this case, for some later comparison, eg:不过,我需要当前曲调的数据库 id(不带前缀),在这种情况下,用于以后的比较,例如:

  <entity name="tune_author" query="SELECT bt.*, CONCAT(ba.firstname, ' ', ba.surname, ' ', ba.extrainfo) AS tune_author FROM bassculture_tune bt, bassculture_item c, bassculture_source s, bassculture_author ba WHERE bt.id=${tune.id} AND bt.item_id = c.id AND c.source_id = s.id AND s.author_id = ba.id;">
    <field column="tune_author" name="tune_author" />
  </entity>

Since ${tune.id} now has a prefix the whole query doesn't do what I need any more.由于 ${tune.id} 现在有一个前缀,因此整个查询不再执行我需要的操作。 Is there a way to strip the prefix locally?有没有办法在本地剥离前缀?

Edit 2编辑 2

The query查询

<entity name="tune_author" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt WHERE bt.id=REPLACE(${tune.id}, 'tune_', '') AND c.source_id = s.id AND bt.item_id = c.id;">

throws an error (unable to execute query) on importing data on Solr.在 Solr 上导入数据时引发错误(无法执行查询)。

This is the error in the Solr log:这是 Solr 日志中的错误:

Caused by: java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT REPLACE(tune_1, 'tune_', ''), AND c.source_id = s.id AND bt.item_id = c.i' at line 1

PS附言

Something like就像是

select item_id FROM bassculture_tune bt WHERE bt.id= (SELECT REPLACE('tune_1', 'tune_', ''));

on MySQL console works just fine.在 MySQL 控制台上工作得很好。

Introducing variables引入变量

I'm trying my luck with a variable now:我现在用一个变量试试运气:

<entity name="this_tune_id" query="SET @this_tune_id = REPLACE('${tune.id}','tune_','');">
        </entity>
      <entity name="source_title" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt WHERE c.source_id = s.id AND bt.item_id = c.id AND bt.id = ${this_tune_id};">
        <field column="short_title" name="source_title"/>
      </entity>

This gives me a这给了我一个

org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 1

error.错误。

FINAL SOLUTION最终解决方案

I am storing the database ID as this_tune_id, and the Solr id (with the prefix) as id, so that I can use this_tune_id for my queries while still storing a prefixed id in Solr:我将数据库 ID 存储为 this_tune_id,并将 Solr id(带前缀)存储为 id,以便我可以将 this_tune_id 用于我的查询,同时仍将前缀 id 存储在 Solr 中:

<entity name="tune" query="SELECT name,start_page,alternate_spellings,item_id, id AS this_tune_id, CONCAT('tune_', id) AS id, 'tune' as type FROM bassculture_tune;">

  <field column="name" name="name"/>

  <entity name="source_title" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt WHERE c.source_id = s.id AND bt.item_id = c.id AND bt.id = ${tune.this_tune_id};">

The screenshot containing data from your import reveals the reason: maxDocs shows there has been 2357 documents imported;包含导入数据的屏幕截图揭示了原因: maxDocs显示已导入 2357 个文档; but there is 419 that has been marked as deleted.但是有 419 已被标记为已删除。 Your unique key field (usually id ) has overlap between the documents you're importing, resulting in the newer documents overwriting the older ones.您的唯一键字段(通常id )在您正在导入的文档之间有重叠,导致较新的文档覆盖较旧的文档。

419 documents has been overwritten by documents imported later because over overlapping ids. 419 个文档已被后来导入的文档覆盖,因为 ID 重叠。

You can solve this by prepending the entity type to your ids (there is no need for the ids to be numeric) - the easiest way is to prefix it in your SQL:您可以通过将实体类型添加到您的 id 来解决这个问题(id 不需要是数字) - 最简单的方法是在您的 SQL 中添加前缀:

SELECT CONCAT('tune_', id) AS id, FROM ..
SELECT CONCAT('author_', id) AS id, .. FROM ..
... repeating for each source ..

That way the id for an author will be author_1 and will not overwrite tune_1 as it would otherwise, where both would have 1 as their ids.这样,作者的 id 将是author_1并且不会覆盖tune_1 ,否则两者都会有1作为他们的 id。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM