简体   繁体   English

为什么我的关联模型不应该在数据集中找到子组?

[英]Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. 我提供了许多有关编写代码的方法的信息。 If you just want to read my question, skip to the quotes at the end. 如果您只想阅读我的问题,请跳至最后的引号。

I'm working on a project that has a goal of detecting sub populations in a group of patients. 我正在做一个项目,目标是检测一组患者中的亚人群。 I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject. 我认为这听起来像是使用关联规则挖掘的绝佳机会,因为我目前正在就该主题进行课程。

I there are 42 variables in total. 我总共有42个变量。 Of those, 20 are continuous and had to be discretized. 其中有20个是连续的,必须离散化。 For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into. 对于每个变量,我使用Freedman-Diaconis规则来确定将组划分为多少个类别。

def Freedman_Diaconis(column_values):

    #sort the list first
    column_values[1].sort()

    first_quartile = int(len(column_values[1]) * .25)
    third_quartile = int(len(column_values[1]) * .75)

    fq_value = column_values[1][first_quartile]
    tq_value = column_values[1][third_quartile]
    iqr = tq_value - fq_value
    n_to_pow = len(column_values[1])**(-1/3)
    h = 2 * iqr * n_to_pow
    retval = (column_values[1][-1] - column_values[1][1])/h
    test = int(retval+1)
    return test

From there I used min-max normalization 从那里我使用了最小-最大归一化

def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints

to transform my data and then I simply took the interger portion to get the final categorization. 转换我的数据,然后我简单地使用了interger部分来进行最终分类。

def take_int(list_of_float):
ints = []

for flt in list_of_float:
    asint = int(flt)
    ints.append(asint)

return ints

I then also wrote a function that I used to combine this value with the variable name. 然后,我还编写了一个函数,用于将该值与变量名结合在一起。

def string_transform(prefix, column, index):

transformed_list = []
transformed = ""
if index < 4:
    for entry in column[1]:
        transformed = prefix+str(entry)
        transformed_list.append(transformed)
else:
    prefix_num = prefix.split('x')
    for entry in column[1]:
        transformed = str(prefix_num[1])+'x'+str(entry)
        transformed_list.append(transformed)

return transformed_list

This was done to differentiate variables that have the same value, but appear in different columns. 这样做是为了区分具有相同值但出现在不同列中的变量。 For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. 例如,变量x14的值为1意味着不同于变量x20的值为1。 The string transform function would create 14x1 and 20x1 for the previously mentioned examples. 字符串转换函数将为前面提到的示例创建14x1和20x1。

After this, I wrote everything to a file in basket format 在此之后,我将所有内容以篮子格式写入文件

def create_basket(list_of_lists, headers):

#for filename in os.listdir("."):
#    if filename.e

if not os.path.exists('baskets'):
    os.makedirs('baskets')

down_length = len(list_of_lists[0])

with open('baskets/dataset.basket', 'w') as basketfile:
    basket_writer = csv.DictWriter(basketfile, fieldnames=headers)

    for i in range(0, down_length):

        basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
                                "x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
                                "x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
                                "x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
                                "x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
                                "x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
                                "x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
                                "x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
                                "x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
                                "x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
                                "x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
                                "x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
                                "x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
                                "x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})

and I used the apriori package in Orange to see if there were any association rules. 然后我使用了Orange中的apriori软件包来查看是否存在任何关联规则。

rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s  %s" % ("Supp", "Conf", "Rule")
for r in rules:

    my_rule = str(r)
    split_rule = my_rule.split("->")

    if 'trt' in split_rule[1]:
        print 'treatment rule'
        print "%4.1f %4.1f  %s" % (r.support, r.confidence, r)

Using this, technique I found quite a few association rules with my testing data. 使用这种技术,我发现了很多与测试数据相关的规则。

THIS IS WHERE I HAVE A PROBLEM 这是我遇到的问题

When I read the notes for the training data, there is this note 当我阅读训练数据的注释时,有此注释

...That is, the only reason for the differences among observed responses to the same treatment across patients is random noise. ……就是说,患者之间对相同治疗方法观察到的反应差异的唯一原因是随机噪声。 Hence, there is NO meaningful subgroup for this dataset... 因此,该数据集没有有意义的子组...

My question is, 我的问题是

why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything? 为什么我得到多个关联规则,这些规则暗示有子组,但根据注释我什么也看不到?

I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state. 如果所有内容都是随机的(如音符状态),我得到的提升数字将高于2,而您应该期望的数字是1。

Supp Conf  Rule
 0.3  0.7  6x0 -> trt1

Even though my code runs, I'm not getting results anywhere close to what should be expected. 即使我的代码运行,我也无法获得接近预期的结果。 This leads me to believe that I messed something up, but I'm not sure what it is. 这使我相信自己搞砸了,但不确定是什么。

After some research, I realized that my sample size is too small for the number of variables that I have. 经过研究,我意识到我的样本量对于我拥有的变量数量来说太小了。 I would need a way larger sample size in order to really use the method that I was using. 为了真正使用我所使用的方法,我需要一个更大的样本量。 In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows. 实际上,我尝试使用的方法是在假设该方法可以在具有数十万或数百万行的数据库上运行的前提下开发的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我的 Python 线程不应该启动? - Why does my Python thread start when it shouldn't? 为什么我的代码不应该重复? - Why is my code repeating when it shouldn't? 为什么我的回文检查器找不到任何结果? - Why does my palindrome checker doesn't find any result? 为什么我的脚本先打开然后不应该立即关闭? (蟒蛇) - Why is my script is opening then closing immediately when it shouldn't? (Python) 当我更改不应触发任何更新的按钮设置时,为什么我的 plot 会被面板更新(两次)? (面板霍洛维兹) - Why is my plot updated by panel (twice) when I change a button setting that shouldn't trigger any updates? (Panel Holoviz) Django model 在不应该保存的时候保存 - Django model saves when it shouldn't 为什么我的正则表达式模式中的条件语句在不应该改变任何东西的情况下摆脱了其他匹配? - Why is the conditional statement in my regex pattern getting rid of other matches when it shouldn't change anything? 为什么 SymPy 没有为我的方程找到任何解? - Why does SymPy not find any solutions for my equation? 为什么使用find()的查询不返回任何文档? - Why my query using find() does not return any document? 为什么我不应该将我的文件命名为模块的名称? - Why shouldn't I name my file the name of a module?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM