简体   繁体   English

存储和搜索复杂数字数据的方式有哪些?

[英]What are the ways to store and search complex numeric data?

I have some numerical data that must be searchable from a web front-end with the following format: 我有一些必须从Web前端以以下格式搜索的数字数据:

Toy type: Dog 玩具类型:狗
Toy subtype: Spotted 玩具子类型:斑点
Toy maker: John 玩具制造商:约翰
Color: White 白颜色
Estimated spots: 10 预计景点:10
Actual spots: 11 实际景点:11

Toy type: Cat 玩具类型:猫
Toy subtype: Striped 玩具子类型:条纹
Toy maker: Jane 玩具厂商:简
Color: White 白颜色
Estimated stripes: 5 估计条纹:5
Actual stripes: [Not yet counted] 实际条纹:[尚未计算]

A search query might be something like "Type:Cat, Stripes:4-6", or "Type:Dog, Subtype:Spotted", or "Color:White", or "Color:White, Maker:John". 搜索查询可能类似于“类型:猫,条纹:4-6”或“类型:狗,子类型:斑点”或“颜色:白色”或“颜色:白色,制造商:约翰”。

I'm not sure if the data is best suited for a relational database because there are several types and subtypes, each with their own properties. 我不确定数据是否最适合关系数据库,因为有几种类型和子类型,每种都有自己的属性。 On top of that, there are estimated and actual values for each property. 最重要的是,每个属性都有估计值和实际值。

I'd like some recommendations for how to store and search this data. 我想要一些有关如何存储和搜索此数据的建议。 Please help! 请帮忙!

EDIT: I changed the search queries so they are no longer free-form. 编辑:我更改了搜索查询,因此它们不再是自由格式。

You have structured the problem in such a way as to make this very difficult to solve. 您以一种很难解决的方式构造了问题。 Your data is structured data, with specific columns. 您的数据是具有特定列的结构化数据。 Yet, you are trying to use free-form queries to search through it. 但是,您正在尝试使用自由格式查询来搜索它。

So, the normal way to do this is to allow search terms for each of the fields. 因此,执行此操作的通常方法是允许每个字段都包含搜索词。

The next way to approach this is as a full-text problem. 解决此问题的另一种方法是将其作为全文问题。 This definitely has its issues. 这肯定有问题。 For instance, numbers are typically stop words. 例如,数字通常是停用词。 And values in different fields would get confused with each other. 并且不同领域中的价值会相互混淆。

Of course, you can try to do free form search on structured data. 当然,您可以尝试对结构化数据进行自由格式搜索。 This is, after all, something that Google and Microsoft are doing. 毕竟,这是Google和Microsoft正在做的事情。 If you search "airfare from New York to London" on Google, you will get lists of flights. 如果您在Google上搜索“从纽约到伦敦的机票”,您将获得航班清单。 But this is a hard problem to approach through understanding the query. 但这是通过理解查询来解决的一个难题。

I recommend using Apache Solr to index and search your data. 我建议使用Apache Solr索引和搜索您的数据。

How you use Solr depends on your requirements. Solr的使用方式取决于您的要求。 I use it as a searchable cache of my data. 我将其用作数据的可搜索缓存。 Extremely useful when the raw master data must be keep as files. 当原始主数据必须保存为文件时,此功能非常有用。 Lots of frameworks integrate Solr as their search backend. 许多框架都将Solr集成为他们的搜索后端。

For building front-ends to a Solr index, checkout solr-ajax . 要为Solr索引构建前端,请签出solr-ajax

Example

Install Solr 安装Solr

Download Solr distribution: 下载Solr发行版:

wget http://www.apache.org/dist/lucene/solr/4.7.0/solr-4.7.0.tgz
tar zxvf solr-4.7.0.tgz

Start Solr using embedded Jetty container: 使用嵌入式Jetty容器启动Solr:

cd solr-4.7.0/example
java -jar start.jar

Solr should now be running locally Solr现在应该在本地运行

http://localhost:8983/solr

data.xml data.xml

You did not specify a data format so I used the native XML supported by Solr: 您没有指定数据格式,所以我使用了Solr支持的本机XML:

<add>
  <doc>
    <field name="id">1</field>
    <field name="toy_type_s">Dog</field>
    <field name="toy_subtype_s">Spotted</field>
    <field name="toy_maker_s">John</field>
    <field name="color_s">White</field>
    <field name="estimated_spots_i">10</field>
    <field name="actual_spots_i">11</field>
  </doc>
  <doc>
    <field name="id">2</field>
    <field name="toy_type_s">Cat</field>
    <field name="toy_subtype_s">Striped</field>
    <field name="toy_maker_s">Jane</field>
    <field name="color_s">White</field>
    <field name="estimated_spots_i">5</field>
  </doc>
</add>

Notes: 笔记:

  • Every document in Solr must have a unique id Solr中的每个文档都必须具有唯一的ID
  • The field names have a trailing "_s" and "_i" in their names to indicate field types. 字段名称的名称中带有尾随的“ _s”和“ _i”以指示字段类型。 This is a cheat to take advantage of Solr's dynamic field feature. 这是利用Solr的动态场功能的作弊手段。

Index XML file 索引XML文件

Lots of ways to get data into Solr. 有很多方法可以将数据获取到Solr。 The simplest way is the curl command: 最简单的方法是curl命令:

curl http://localhost:8983/solr/update?commit=true -H "Content-Type: text/xml" --data-binary @data.xml

It's worth noting that Solr supports other data formats, such as JSON and CSV. 值得注意的是,Solr支持其他数据格式,例如JSON和CSV。

Search indexed file 搜索索引文件

Again there are language libraries to support Solr searches, the following examples use curl. 再次有语言库支持Solr搜索,以下示例使用curl。 The Solr search syntax is along the lines you've required. Solr搜索语法符合您的要求。

Here's a simple example: 这是一个简单的例子:

$ curl http://localhost:8983/solr/select/?q=toy_type_s:Cat
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">1</int>
    <lst name="params">
      <str name="q">toy_type_s:Cat</str>
    </lst>
  </lst>
  <result name="response" numFound="1" start="0">
    <doc>
      <str name="id">2</str>
      <str name="toy_type_s">Cat</str>
      <str name="toy_subtype_s">Striped</str>
      <str name="toy_maker_s">Jane</str>
      <str name="color_s">White</str>
      <int name="estimated_spots_i">5</int>
      <long name="_version_">1463999035283079168</long>
    </doc>
  </result>
</response>

A more complex search example: 一个更复杂的搜索示例:

$ curl "http://localhost:8983/solr/select/?q=toy_type_s:Cat%20AND%20estimated_spots_i:\[2%20TO%206\]" 
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">2</int>
    <lst name="params">
      <str name="q">toy_type_s:Cat AND estimated_spots_i:[2 TO 6]</str>
    </lst>
  </lst>
  <result name="response" numFound="1" start="0">
    <doc>
      <str name="id">2</str>
      <str name="toy_type_s">Cat</str>
      <str name="toy_subtype_s">Striped</str>
      <str name="toy_maker_s">Jane</str>
      <str name="color_s">White</str>
      <int name="estimated_spots_i">5</int>
      <long name="_version_">1463999035283079168</long>
    </doc>
  </result>
</response>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM