为什么我的关联模型不应该在数据集中找到子组？

Question

I give a lot of information on the methods that I used to write my code. 我提供了许多有关编写代码的方法的信息。 If you just want to read my question, skip to the quotes at the end. 如果您只想阅读我的问题，请跳至最后的引号。

I'm working on a project that has a goal of detecting sub populations in a group of patients. 我正在做一个项目，目标是检测一组患者中的亚人群。 I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject. 我认为这听起来像是使用关联规则挖掘的绝佳机会，因为我目前正在就该主题进行课程。

I there are 42 variables in total. 我总共有42个变量。 Of those, 20 are continuous and had to be discretized. 其中有20个是连续的，必须离散化。 For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into. 对于每个变量，我使用Freedman-Diaconis规则来确定将组划分为多少个类别。

def Freedman_Diaconis(column_values):

    #sort the list first
    column_values[1].sort()

    first_quartile = int(len(column_values[1]) * .25)
    third_quartile = int(len(column_values[1]) * .75)

    fq_value = column_values[1][first_quartile]
    tq_value = column_values[1][third_quartile]
    iqr = tq_value - fq_value
    n_to_pow = len(column_values[1])**(-1/3)
    h = 2 * iqr * n_to_pow
    retval = (column_values[1][-1] - column_values[1][1])/h
    test = int(retval+1)
    return test

From there I used min-max normalization 从那里我使用了最小-最大归一化

def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints

to transform my data and then I simply took the interger portion to get the final categorization. 转换我的数据，然后我简单地使用了interger部分来进行最终分类。

def take_int(list_of_float):
ints = []

for flt in list_of_float:
    asint = int(flt)
    ints.append(asint)

return ints

I then also wrote a function that I used to combine this value with the variable name. 然后，我还编写了一个函数，用于将该值与变量名结合在一起。

def string_transform(prefix, column, index):

transformed_list = []
transformed = ""
if index < 4:
    for entry in column[1]:
        transformed = prefix+str(entry)
        transformed_list.append(transformed)
else:
    prefix_num = prefix.split('x')
    for entry in column[1]:
        transformed = str(prefix_num[1])+'x'+str(entry)
        transformed_list.append(transformed)

return transformed_list

This was done to differentiate variables that have the same value, but appear in different columns. 这样做是为了区分具有相同值但出现在不同列中的变量。 For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. 例如，变量x14的值为1意味着不同于变量x20的值为1。 The string transform function would create 14x1 and 20x1 for the previously mentioned examples. 字符串转换函数将为前面提到的示例创建14x1和20x1。

After this, I wrote everything to a file in basket format 在此之后，我将所有内容以篮子格式写入文件

def create_basket(list_of_lists, headers):

#for filename in os.listdir("."):
#    if filename.e

if not os.path.exists('baskets'):
    os.makedirs('baskets')

down_length = len(list_of_lists[0])

with open('baskets/dataset.basket', 'w') as basketfile:
    basket_writer = csv.DictWriter(basketfile, fieldnames=headers)

    for i in range(0, down_length):

        basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
                                "x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
                                "x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
                                "x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
                                "x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
                                "x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
                                "x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
                                "x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
                                "x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
                                "x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
                                "x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
                                "x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
                                "x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
                                "x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})

and I used the apriori package in Orange to see if there were any association rules. 然后我使用了Orange中的apriori软件包来查看是否存在任何关联规则。

rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s  %s" % ("Supp", "Conf", "Rule")
for r in rules:

    my_rule = str(r)
    split_rule = my_rule.split("->")

    if 'trt' in split_rule[1]:
        print 'treatment rule'
        print "%4.1f %4.1f  %s" % (r.support, r.confidence, r)

Using this, technique I found quite a few association rules with my testing data. 使用这种技术，我发现了很多与测试数据相关的规则。

THIS IS WHERE I HAVE A PROBLEM 这是我遇到的问题

When I read the notes for the training data, there is this note 当我阅读训练数据的注释时，有此注释

...That is, the only reason for the differences among observed responses to the same treatment across patients is random noise. ……就是说，患者之间对相同治疗方法观察到的反应差异的唯一原因是随机噪声。 Hence, there is NO meaningful subgroup for this dataset... 因此，该数据集没有有意义的子组...

My question is, 我的问题是

why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything? 为什么我得到多个关联规则，这些规则暗示有子组，但根据注释我什么也看不到？

I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state. 如果所有内容都是随机的（如音符状态），我得到的提升数字将高于2，而您应该期望的数字是1。

Supp Conf  Rule
 0.3  0.7  6x0 -> trt1

Even though my code runs, I'm not getting results anywhere close to what should be expected. 即使我的代码运行，我也无法获得接近预期的结果。 This leads me to believe that I messed something up, but I'm not sure what it is. 这使我相信自己搞砸了，但不确定是什么。

Answer 1

After some research, I realized that my sample size is too small for the number of variables that I have. 经过研究，我意识到我的样本量对于我拥有的变量数量来说太小了。 I would need a way larger sample size in order to really use the method that I was using. 为了真正使用我所使用的方法，我需要一个更大的样本量。 In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows. 实际上，我尝试使用的方法是在假设该方法可以在具有数十万或数百万行的数据库上运行的前提下开发的。

为什么我的关联模型不应该在数据集中找到子组？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-05-18 12:21:43

为什么我的关联模型不应该在数据集中找到子组？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-05-18 12:21:43

解决方案1
0 已采纳 2016-05-18 12:21:43