简体繁体 English

信息检索（IR）与数据挖掘与机器学习（ML）

[英]Information retrieval (IR) vs data mining vs Machine Learning (ML)

原文 2010-08-05 18:04:16 0 4 machine-learning/ data-mining/ information-retrieval

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. 人们经常抛弃IR，ML和数据挖掘这两个术语，但我注意到它们之间有很多重叠。

From people with experience in these fields, what exactly draws the line between these? 对于那些在这些领域有经验的人来说，究竟是什么划清界限？

4 个解决方案

This is just the view of one person (formally trained in ML); 这只是一个人的观点（正式接受ML训练）; others might see things quite differently. 其他人可能会看到完全不同

Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves. 机器学习可能是这三个术语中最同质的，并且应用最为一致 - 它仅限于模式提取 （或模式匹配）算法本身。

Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. 在你提到的术语中，“机器学习”是学术部门最常用来描述他们的课程，他们的学术部门和他们的研究课程，以及学术期刊和会议论文中最常用的术语。 ML is clearly the least context-dependent of the terms you mentioned. ML显然是与您提到的术语相关的最少依赖于上下文的。

Information Retrieval and Data Mining are much closer to describing complete commercial processes --ie, from user query to retrieval/delivery of relevant results. 信息检索和数据挖掘更接近于描述完整的商业流程 - 从用户查询到检索/交付相关结果。 ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. ML算法可能在该流程中的某个地方，而在更复杂的应用程序中，通常是，但这不是正式的要求。 In addition, the term Data Mining seems usually to refer to application of some process flow on big data (ie, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow. 此外，术语数据挖掘通常似乎是指对大数据 （即> 2BG）应用某些流程，因此通常包括该工作流前端附近的分布式处理（map-reduce）组件。

So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. 因此，信息检索（IR）和数据挖掘（DM）以基础设施算法的方式与机器学习（ML）相关。 In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. 换句话说，机器学习是用于解决信息检索中的问题的工具的一个来源。 But it's only one source of tools. 但它只是工具的一个来源。 But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, ie, the round-trip from query to delivering the search results to the user. 但IR并不依赖于ML - 例如，特定的IR项目可能是存储和快速检索完全索引的数据，响应用户的搜索查询IR，其关键是优化数据流的性能，即，从查询到将搜索结果传递给用户的往返。 Prediction or pattern matching might not be useful here. 预测或模式匹配在这里可能没用。 Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns). 同样，DM项目可能会将ML算法用于预测引擎，但DM项目更可能也关注整个处理流程 - 例如，用于高效输入大量数据量的并行计算技术（也许是TB））它将原始结果传递给处理引擎，用于计算描述性统计（变量（列）的平均值，标准偏差，分布等）。

Lastly consider the Netflix Prize. 最后考虑一下Netflix奖。 This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. 本次竞赛仅针对机器学习 - 重点是预测算法，事实证明只有一个成功标准：算法返回的预测准确性。 Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. 想象一下，如果将'Netflix奖'重新命名为数据挖掘竞赛。 The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy. 成功标准几乎肯定会扩展到更准确地在实际商业环境中访问算法的性能 - 例如总体执行速度（提供给用户的推荐的速度）可能会与准确性一起考虑。

The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. 术语“信息检索”和“数据挖掘”现在已成为主流使用，但有一段时间我只在工作描述或供应商文献中看到这些术语（通常在“解决方案”一词旁边）。在我的雇主，我们最近聘请了一位“数据挖掘”分析师。 I don't know what he does exactly, but he wears a tie to work every day. 我不知道他到底做了什么，但他每天都戴着领带上班。

I'd try to draw the line as follows: 我试着画出如下线：

Information retrieval is about finding something that already is part of your data, as fast as possible. 信息检索是指尽可能快地找到已经成为数据一部分的内容。

Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible. 机器学习是将现有知识概括为新数据的技术，尽可能准确。

Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible. 数据挖掘主要是为了发现您之前不知道的数据中隐藏的内容 ，尽可能“新”。

They intersect and often use techniques of one another. 它们交叉并经常使用彼此的技术。 DM and IR both use index structures to accelerate processes. DM和IR都使用索引结构来加速进程。 DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge. DM使用了许多ML技术，例如数据集中对泛化有用的模式可能是一种新知识。

They are often hard to separate. 它们通常很难分开。 Do yourself a favor and don't just go for the buzzwords. 帮自己一个忙，不要只是为了流行语。 In my opinion the best way of distinguishing them is by their intention , as given above: find data, generalize to new data, find new properties of existing data. 在我看来，区分它们的最佳方式是它们的意图，如上所述：查找数据，推广到新数据，查找现有数据的新属性。

You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned. 您还可以将模式识别和（计算？）统计数据添加为与您提到的三个重叠的另外几个区域。

I'd say there is no well-defined line between them. 我会说他们之间没有明确的界限。 What separates them is their history and their emphases. 它们的区别在于它们的历史和重点。 Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between. 统计强调数学严谨，数据挖掘强调缩放到大数据集，ML介于两者之间。

Data mining is about discovering hidden patterns or unknown knowledge, which can be used for decision making by people. 数据挖掘是关于发现隐藏模式或未知知识，可用于人们的决策。

Machine learning is about learning a model to classify new objects. 机器学习是关于学习模型来分类新对象。