简体   繁体   English

Solr:DIH用于多语言索引和多值字段吗?

[英]Solr: DIH for multilingual index & multiValued field?

I have a MySQL table: 我有一个MySQL表:

CREATE TABLE documents (
    id INT NOT NULL AUTO_INCREMENT,
    language_code CHAR(2),
    tags CHAR(30),
    text TEXT,
    PRIMARY KEY (id)
);

I have 2 questions about Solr DIH: 我对Solr DIH有2个问题:

1) The langauge_code field indicates what language the text field is in. And depending on the language, I want to index text to different Solr fields. 1) langauge_code字段指示text字段使用的语言。根据语言,我想将text索引到不同的Solr字段。

# pseudo code

if langauge_code == "en":
    index "text" to Solr field "text_en"
elif langauge_code == "fr":
    index "text" to Solr field "text_fr"
elif langauge_code == "zh":
    index "text" to Solr field "text_zh"
...

Can DIH handle a usecase like this? DIH可以处理这样的用例吗? How do I configure it to do so? 我该如何配置呢?

2) The tags field needs to be indexed into a Solr multiValued field. 2) tags字段需要索引到Solr多multiValued字段中。 Multiple values are stored in a string, separated by a comma. 多个值存储在字符串中,以逗号分隔。 For example, if tags contains the string "blue, green, yellow" then I want to index the 3 values "blue" , "green" , "yellow" into a Solr multiValued field. 例如,如果tags包含字符串"blue, green, yellow"那么我想将3个值"blue""green""yellow"索引到Solr多值字段中。

How do I do that with DIH? 如何用DIH做到这一点?

Thanks. 谢谢。

First your schema needs to allow it with something like this: 首先,您的架构需要允许使用以下内容:

<dynamicField name="text_*" type="string" indexed="true" stored="true" />

Then in your DIH config something like this: 然后在您的DIH配置中,如下所示:

<entity name="document" dataSource="ds1" transformer="script:ftextLang" query="SELECT * FROM documents" />

With the script being defined just below the datasource: 将脚本定义在数据源正下方:

<script><![CDATA[
  function ftextLang(row){
     var name = row.get('language_code');
     var value = row.get('text');
     row.put('text_'+name, value); return row;
  }
]]></script>

I'm sorry I don't have a direct answer about your DIH question, though it'd be interesting to know. 很抱歉,我对您的DIH问题没有直接答案,尽管这很有趣。

I did notice your 2 letter language code and suggest a 5 letter slot. 我确实注意到您的2个字母的语言代码,并建议使用5个字母的广告位。 Some languages have dialect differences that are non trivial. 某些语言的方言差异不小。 For example, Simplified Chinese vs. Traditional Chinese. 例如,简体中文与繁体中文。 For morphological analysis, the SmartCN filter can handle zh-cn, but not zh-tw, etc. 对于形态分析,SmartCN过滤器可以处理zh-cn,但不能处理zh-tw等。

Portuguese and Spanish are also languages where we've been warned against mixing all dialects together, although the differences are less drastic, and both would still be searchable. 葡萄牙语和西班牙语也是警告我们不要将所有方言混合在一起的语言,尽管差异不太明显,而且仍然可以搜索。

Of course you may have already known this, and just didn't add it to the question to keep it simple. 当然,您可能已经知道了这一点,只是没有将其添加到问题中以使其简单。 It's just a subject very fresh on my mind. 这只是我脑海中新鲜的话题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM