简体   繁体   中英

How to store a decision tree while I'm building it in C?

Conceptual question here.

I am building a decision tree recursively. Each iteration of the function takes a subset of the training examples, iterates through all features and all possible splits within each feature, find the best split possible, splits the subset into two smaller subsets and calls itself (the function) twice, one for each split-subset.

I have coded this previously in MatLab, but it ran too slowly so now I'm trying it in C (which I'm less familiar with). In MatLab, I used a global 'splits' matrix that held every split's information (which feature, what value within that feature, what was the classification if this is a leaf node, row #'s for each of the children), and that way I could traverse through that matrix with a new test datapoint to find its classification.

It looks like a global 2D array in C is possible with a header file, but I'd rather not get into header files if there's another way to do it. The problem is, because the functions are called recursively, it's tough for me to know what the next available row is in 'splits'. I could do something like children's rows are 2*i and 2*i+1 of the parent's row, but for a large array with a lot of splits this would take a huge amount of initial storage.

Any ideas?

Sounds to me like you have to let go of the 2D array to represent your tree. A tree of arbitrary degree in C typically looks like:

struct node
{       struct node ** children;
        int num_children;
        /* Values in the node/leafs */
};

If the degree of the tree is fixed, or there is a maximum, for every node, then the following would do

struct node
{       struct node * children;
        int num_children; /* If degree has only a maximum */
        /* Values in the node/leafs */
};

You will have to use malloc and friends to allocate memory for the nodes and their children.

About the header file: header files are a blessing (in C that is), not a curse, but if you insist on doing without, then they can always replace their #include instance.

If you are going from MatLab to another language to speed up your implementation, then you may want to consider other languages besides C first. Languages like Java, Python or even Haskell may give you similar speedups, but are less of a hassle with all the pointers.

Using this sort of functional design isn't pretty, in C, because there are no guarantees that recursive calls will be optimised into loops, and there are no anonymous functions. At least there are lambdas in C++; I would suggest that C++ is more suitable for this, though AFAIK there still is no guarantee of optimisation in C++.

In order to avoid the potential for recursion which would result in probable stack growth, each branch needs to return the next branch it selects. The caller (main) then loops on the return value, and terminates the loop when the return value is our terminal value. We define a branch type as a function pointer:

typedef void branch();

We then declare each actual branch to return a branch type:

branch initial(void) {
    /* do initial processing */
    srand(time(NULL));
    int x = rand() % 2;
    return x == 0 ? left : right;
}

branch terminal(void) {
    /* This should never be called */
    assert(0);
    return NULL;
}

branch left(void) {
    /* do left processing */
    return terminal; /* return a terminal branch to indicate no further
                      * processing */
}

branch right(void) {
    int x;
    /* do right processing, storing either a 0 in x to indicate right_left
     * as the selected branch, or 1 in x to indicate right_right...
     */
    return x == 0 ? right_left : right_right;
}

branch right_left(void) {
    /* do right_left processing */
    return initial; /* return initial to repeat that branch */
}

branch right_right(void) {
    /* do right_right processing; */
    return right; /* return right to repeat that branch */
}

... and looping on the return value would look like this:

int main(void) {
    branch *(b)(void) = initial;
    while (b != terminal) {
        b = b();
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM