简体   繁体   English

在solr中输入任意xml

[英]inputting arbitrary xml in solr

I have a question on Apache Solr. 我对Apache Solr有疑问。 If I have an arbitrary XML file, and a XSD that it conforms to, how do I input it into Solr. 如果我有一个任意的XML文件,以及它符合的XSD,我该如何将它输入到Solr中。 Could I get a code sample? 我可以获得代码示例吗? I know you have to parse the XML and put the relevant data in a solr input doc, but I don't understand how to do that. 我知道你必须解析XML并将相关数据放在solr输入文档中,但我不明白如何做到这一点。

The DataImportHandler (DIH) allows you to pass the incoming XML to an XSL, as well as parse and transform the XML with DIH transformers. DataImportHandler(DIH)允许您将传入的XML传递给XSL,以及使用DIH变换器解析和转换XML。 You could translate your arbitrary XML to Solr's standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. 您可以通过XSL将任意XML转换为Solr的标准输入XML格式,或者将任意XML映射/转换为DIH配置文件中的Solr架构字段,或两者的组合。 DIH is flexible. DIH很灵活。

Sample dih-config.xml 示例dih-config.xml

Here's a sample dih-config.xml from a an actual working site (no pseudo-samples here, my friend). 这是一个来自实际工作站点的示例dih-config.xml(这里没有伪样本,我的朋友)。 Note that it picks up xml files from a local directory on the LAMP server. 请注意,它从LAMP服务器上的本地目录中获取xml文件。 If you prefer to post xml files directly via HTTP you would need to configure a ContentStreamDataSource instead. 如果您希望直接通过HTTP发布xml文件,则需要配置ContentStreamDataSource

It so happens that the incoming xml is already in standard Solr update xml format in this sample, and all the XSL does is remove empty field nodes, while the real transforms, such as building the content of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume", and "ignored_seriesissue", are done with DIH Regex transformers. 在本示例中,传入的xml已经采用标准的Solr update xml格式,并且所有XSL都会删除空字段节点,而真正的转换,例如从“ignored_seriestitle”构建“ispartof_t”的内容,“使用DIH Regex变换器完成ignored_seriesvolume“和”ignored_seriesissue“。 (The XSLT is performed first, and the output of that is then given to the DIH transformers.) The attribute "useSolrAddSchema" tells DIH that the xml is already in standard Solr xml format. (首先执行XSLT,然后将其输出提供给DIH变换器。)属性“useSolrAddSchema”告诉DIH xml已经是标准的Solr xml格式。 If that were not the case, another attribute, "xpath", on the XPathEntityProcessor would be required to select content from the incoming xml document. 如果不是这种情况,则需要XPathEntityProcessor上的另一个属性“xpath”来从传入的xml文档中选择内容。

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" />
    <document>
        <!--
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 
        -->
        <entity
            name="pickupdir"
            processor="FileListEntityProcessor"
            rootEntity="false"
            dataSource="null"
            fileName="^[\w\d-]+\.xml$"
            baseDir="/var/lib/tomcat6/solr/cci/import/"
            recursive="true"
            newerThan="${dataimporter.last_index_time}"
        >

        <!--
            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.
        -->
        <entity 
            name="xml"
            processor="XPathEntityProcessor"
            transformer="RegexTransformer,TemplateTransformer"
            datasource="pickupdir"
            stream="true"
            useSolrAddSchema="true"
            url="${pickupdir.fileAbsolutePath}"
            xsl="xslt/dih.xsl"
        >

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />
    </entity>
      </entity>
    </document>
</dataConfig>

To set up DIH: 要设置DIH:

  • Ensure the DIH jars are referenced from solrconfig.xml, as they are not included by default in the Solr WAR file. 确保DIH jar从solrconfig.xml引用,因为它们未在Solr WAR文件中默认包含。 One easy way is to create a lib folder in the Solr instance directory that includes the DIH jars, as the solrconfig.xml looks in the lib folder for references by default. 一种简单的方法是在Solr实例目录中创建一个包含DIH jar的lib文件夹,因为solrconfig.xml默认在lib文件夹中查找引用。 Find the DIH jars in the apache-solr-xxx/dist folder when you download the Solr package. 下载Solr软件包时,在apache-solr-xxx / dist文件夹中找到DIH jar。

dist folder: dist文件夹: solr dih jars的位置

  • Create your dih-config.xml (as above) in the Solr "conf" directory. 在Solr“conf”目录中创建dih-config.xml(如上所示)。

  • Add a DIH request handler to solrconfig.xml if it's not there already. 如果已经存在,则将solH request请求处理程序添加到solrconfig.xml。

request handler: 请求处理程序

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">dih-config.xml</str>
</lst>
</requestHandler>

To trigger DIH: 触发DIH:

There is a lot more info re full-import vs. delta-import and whether to commit, optimize, etc. in the wiki description on Data Import Handler Commands , but the following would trigger the DIH operation without deleting the existing index first, and commit the changes after all the files had been processed. 数据导入处理程序命令的维基描述中有更多信息重新完全导入与增量导入以及是否提交,优化等,但以下将触发DIH操作而不首先删除现有索引,并且在处理完所有文件后提交更改。 The sample given above would collect all the files found in the pickup directory, transform them, index them, and finally, commit the update/s to the index (which would make them searchable the instant commit was finished). 上面给出的示例将收集在拾取目录中找到的所有文件,转换它们,索引它们,最后,将update / s提交到索引(这将使它们可以在即时提交完成时进行搜索)。

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true

最简单的方法可能是使用DataImportHandler ,它允许您首先应用XSL将xml转换为Solr输入xml

After some research and finding nothing fully automated to do what you are asking... I think I found something. 经过一番研究,没有发现任何完全自动化的东西来做你所要求的......我想我找到了一些东西。

Lux SOLR might be what we are looking for http://luxdb.org/SETUP.html Lux SOLR可能正是我们正在寻找的http://luxdb.org/SETUP.html

Seems like it somehow takes SOLR and makes it Lux enabled which indexes arbitrary XML's. 似乎它以某种方式采用SOLR并使其启用Lux,其索引任意XML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM