简体   繁体   中英

Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.

I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.

I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.

def Freedman_Diaconis(column_values):

    #sort the list first
    column_values[1].sort()

    first_quartile = int(len(column_values[1]) * .25)
    third_quartile = int(len(column_values[1]) * .75)

    fq_value = column_values[1][first_quartile]
    tq_value = column_values[1][third_quartile]
    iqr = tq_value - fq_value
    n_to_pow = len(column_values[1])**(-1/3)
    h = 2 * iqr * n_to_pow
    retval = (column_values[1][-1] - column_values[1][1])/h
    test = int(retval+1)
    return test

From there I used min-max normalization

def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints

to transform my data and then I simply took the interger portion to get the final categorization.

def take_int(list_of_float):
ints = []

for flt in list_of_float:
    asint = int(flt)
    ints.append(asint)

return ints

I then also wrote a function that I used to combine this value with the variable name.

def string_transform(prefix, column, index):

transformed_list = []
transformed = ""
if index < 4:
    for entry in column[1]:
        transformed = prefix+str(entry)
        transformed_list.append(transformed)
else:
    prefix_num = prefix.split('x')
    for entry in column[1]:
        transformed = str(prefix_num[1])+'x'+str(entry)
        transformed_list.append(transformed)

return transformed_list

This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.

After this, I wrote everything to a file in basket format

def create_basket(list_of_lists, headers):

#for filename in os.listdir("."):
#    if filename.e

if not os.path.exists('baskets'):
    os.makedirs('baskets')

down_length = len(list_of_lists[0])

with open('baskets/dataset.basket', 'w') as basketfile:
    basket_writer = csv.DictWriter(basketfile, fieldnames=headers)

    for i in range(0, down_length):

        basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
                                "x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
                                "x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
                                "x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
                                "x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
                                "x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
                                "x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
                                "x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
                                "x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
                                "x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
                                "x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
                                "x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
                                "x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
                                "x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})

and I used the apriori package in Orange to see if there were any association rules.

rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s  %s" % ("Supp", "Conf", "Rule")
for r in rules:

    my_rule = str(r)
    split_rule = my_rule.split("->")

    if 'trt' in split_rule[1]:
        print 'treatment rule'
        print "%4.1f %4.1f  %s" % (r.support, r.confidence, r)

Using this, technique I found quite a few association rules with my testing data.

THIS IS WHERE I HAVE A PROBLEM

When I read the notes for the training data, there is this note

...That is, the only reason for the differences among observed responses to the same treatment across patients is random noise. Hence, there is NO meaningful subgroup for this dataset...

My question is,

why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?

I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.

Supp Conf  Rule
 0.3  0.7  6x0 -> trt1

Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.

After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM