简体繁体 English

类别匹配-正则表达式与全文搜索

[英]Category Matching - regex vs full text search

原文 2012-04-18 15:34:24 8 1 c#/ sql/ sql-server-2008

I have a fairly large category table with 1500 categories (some singular words others containing multiple) in it and I'm looking for the best way to match new products to these categories by their title. 我有一个相当大的类别表，其中包含1500个类别（一些单数词，其他包含多个词），我正在寻找按其标题将新产品与这些类别匹配的最佳方法。

I've been looking at using regex and looping through the product description for key words but this wouldn't be very efficient when trying to add over one thousand products at a time, I've also been looking at full text search (FREETEXT and contains) but FreeText search seems to bring back alot of results as its matching any and all words in a product description. 我一直在寻找使用正则表达式并在产品说明中循环搜索关键字，但是当尝试一次添加一千种产品时，这并不是很有效，我也在寻找全文本搜索（FREETEXT和包含），但FreeText搜索似乎会带来很多结果，因为它与产品说明中的所有单词都匹配。

Has anyone done something similar in terms of trying to automate which category a product is by its description and can offer some advice or pointers? 是否有人在尝试使产品描述归类于某个类别方面做了类似的尝试，并且可以提供一些建议或指示？

1 个解决方案

So the question as I understand it is, given a description tell me what category this description is applicable to? 因此，据我所知，给出的描述告诉我该描述适用于什么类别？

A common method to do this kind of work is to build a Naive Bayesian Classification process, and put all of your descriptions through this. 进行此类工作的常用方法是构建朴素贝叶斯分类过程，并通过此过程进行所有描述。

Classification like this usually takes place in two stages. 这样的分类通常分两个阶段进行。

stage 1 : known description/category pairs are used to "train" the classifier. 阶段1：已知的描述/类别对用于“训练”分类器。

stage 2 : Once the classifier is trained, you can then give it unknown data, and it would then return a probability that the description would match a given category. 阶段2：对分类器进行训练后，您可以向其提供未知数据，然后将返回描述与给定类别匹配的概率。

The classifier in this approach is usually pretty accurate, but given we are dealing with statistics, errors usually do creep in 这种方法中的分类器通常非常准确，但是鉴于我们要处理统计信息，因此错误通常会蔓延