简体繁体 English

不使用ILP的关系数据挖掘

[英]Relational Data Mining without ILP

原文 2014-06-17 09:33:14 8 1 algorithm/ relational-database/ classification/ data-mining

I have an huge dataset from a relational database which I need to create a classification model for. 我有一个来自关系数据库的庞大数据集，需要为其创建分类模型。 Normally for this situation I would use ILP but due to special circumstances I can't do that. 通常在这种情况下，我会使用ILP，但由于特殊情况，我不能这样做。

The other way to tackle this would be just to try to aggregate the values when I have a foreign relations however I have thousands of important and distinct rows for some nominal attributes (Ex: A patient with a relation to several distinct drug prescriptions) in which I just can't do that without creating a new attributes for each distinct row of that nominal attribute and furthermore most of the new columns would have NULL values if I do that. 解决此问题的另一种方法是，当我有外交关系时尝试汇总这些值，但是对于某些名义属性（例如：与几种不同药物处方有关系的患者），我有成千上万的重要且不同的行。如果没有为该名义属性的每个不同行创建新属性，我就无法做到这一点，此外，如果我这样做的话，大多数新列将具有NULL值。

Is there any non-ILP algorithm that allows me to data mine relational databases without resort to technique like pivoting which would create thousands of new columns? 是否有任何非ILP算法可以让我在不使用透视等技术的情况下对关系数据库进行数据挖掘，该技术会创建数千个新列？

1 个解决方案

First, some caveats 首先，一些警告

I'm not sure why you can't use your preferred programming (sub-)paradigm*, Inductive Logic Programming (ILP) , or what it is that you're trying to classify. 我不确定为什么您不能使用您喜欢的编程（子）范例*， 归纳逻辑编程（ILP）或您尝试分类的内容。 Giving more detail would probably lead to a much better answer; 提供更多细节可能会导致更好的答案。 especially as it's a little unusual to approach selection of classification algorithms on the basis of the programming paradigm with which they're associated. 尤其是因为根据与之关联的编程范式来选择分类算法有点不寻常。 If your real world example is confidential, then simply make up a fictional-but-analogous example. 如果您的真实示例是机密的，则只需构成一个虚构但类似的示例。

Big Data Classification without ILP 没有ILP的大数据分类

Having said that, after ruling out ILP we have 4 other logic programming paradigms in our consideration set: 话虽如此，在排除ILP之后，我们在考虑的范围内还有其他4种逻辑编程范例：

Abductive 溯
Answer Set 答案集
Constraint 约束
Functional 实用

in addition to the dozens of paradigms and sub-paradigms outside of logic programming. 除了逻辑编程之外的数十种范式和子范式。

Within Functional Logic Programming for instance, there exists extensions of ILP called Inductive Functional Logic Programming , which is based on inversion narrowing (ie inversion of the narrowing mechanism). 例如，在函数逻辑编程中，存在称为归纳函数逻辑编程的ILP扩展，该扩展基于反转变窄（即变窄机制的反转）。 This approach overcomes several limitations of ILP and ( according to some scholars, at least ) is as suitable for application in terms of representation and has the benefit of allowing problems to be expressed in a more natural way. 这种方法克服了ILP的一些局限性（至少根据一些学者的观点）适合于表示形式的应用，并且具有允许以更自然的方式表达问题的好处。

Without knowing more about the specifics of your database and the barriers you face to using ILP, I can't know if this solves your problem or suffers from the same problems. 在不了解数据库详细信息以及使用ILP所面临的障碍的情况下，我不知道这是否解决了您的问题或遭受了同样的问题。 As such, I'll throw out a completely different approach as well. 因此，我还将提出一种完全不同的方法。

ILP is contrasted with "classical" or "propositional" approaches to data mining . ILP与“经典”或“命题”方法进行数据挖掘形成对比。 Those approaches include the meat and bones of Machine Learning like decision trees, neural networks, regression, bagging and other statistical methods. 这些方法包括机器学习的要素，例如决策树，神经网络，回归，装袋和其他统计方法。 Rather than give up on these approaches due to the size of your data, you can join the ranks of many Data Scientists, Big Data engineers and statisticians who utilize High Performance Computing (HPC) to employ these methods on with massive data sets (there are also sampling and other statistical techniques you may choose to utilize to reduce the computational resources and time required to analyze the Big Data in your relational database). 您可以加入使用高性能计算（HPC）将这些方法用于海量数据集的许多数据科学家，大数据工程师和统计学家的行列中，而不是由于数据量大而放弃这些方法。您还可以选择抽样和其他统计技术，以减少分析关系数据库中大数据所需的计算资源和时间。

HPC includes things like utilizing multiple CPU cores, scaling up your analysis with elastic use of servers with high memory and large numbers of fast CPU cores, using high-performance data warehouse appliances, employing clusters or other forms of parallel computing, etc. I'm not sure what language or statistical suite you're analyzing your data with, but as an example this CRAN Task View lists many HPC resources for the R language which would allow you to scale up a propositional algorithm. HPC包括利用多个CPU内核，通过弹性使用具有高内存和大量快速CPU内核的服务器来扩展分析，使用高性能数据仓库设备，采用群集或其他形式的并行计算等。我不确定要使用哪种语言或统计套件分析数据，但是作为示例，该CRAN任务视图列出了R语言的许多HPC资源，这将使您可以扩展命题算法。