计算两个字符串之间的唯一ID重叠

Question

I have a data set with two columns. 我有一个包含两列的数据集。 The first column contains unique user IDs and the second column contains attributes connected to these IDs. 第一列包含唯一的用户ID，第二列包含连接到这些ID的属性。

For example: 例如：

------------------------
User ID     Attribute
------------------------
1234        blond
1235        brunette
1236        blond   
1234        tall
1235        tall
1236        short
------------------------

What I want to know is the correlation between attributes. 我想知道的是属性之间的相关性。 In above example, i want to know how many times a blond is also tall. 在上面的例子中，我想知道金发女郎也高多少次。 My desired output is: 我想要的输出是：

------------------------------
Attr 1     Attr 2     Overlap
------------------------------
blond       tall         1
blond       short        1
brunette    tall         1
brunette    short        0
------------------------------

I tried using pandas to pivot the data and get the output, but as my data set has hundreds of attributes, my current attempt is not feasible. 我尝试使用pandas来转移数据并获取输出，但由于我的数据集有数百个属性，我当前的尝试是不可行的。

df = pandas.read_csv('myfile.csv')    

df.pivot_table(index='User ID', columns'Attribute', aggfunc=len, fill_value=0)

My current output: 我目前的输出：

--------------------------------
Blond   Brunette   Short   Tall
--------------------------------
  0        1         0       1
  1        0         0       1
  1        0         1       0 
--------------------------------

Is there a way to get the output I want? 有没有办法获得我想要的输出？ Thanks in advance. 提前致谢。

Answer 1

You coul use itertools product to find each possible attributes couple, and then match rows on this : 您可以使用itertools product来查找每个可能的属性，然后匹配以下行：

import pandas as pd
from itertools import product

# 1) creating pandas dataframe
df = [  ["1234"    ,    "blond"],
        ["1235"    ,    "brunette"],
        ["1236"    ,    "blond"   ],
        ["1234"    ,    "tall"],
        ["1235"    ,    "tall"],
        ["1236"    ,    "short"]]

df = pd.DataFrame(df)
df.columns = ["id", "attribute"]

#2) creating all the possible attributes binomes
attributs = set(df.attribute)
for attribut1, attribut2 in product(attributs, attributs):
    if attribut1!=attribut2:
        #3) selecting the rows for each attribut
        df1 = df[df.attribute == attribut1]["id"]
        df2 = df[df.attribute == attribut2]["id"]
        #4) finding the ids that are matching both attributs 
        intersection= len(set(df1).intersection(set(df2)))
        if intersection:
            #5) displaying the number of matches
            print attribut1, attribut2, intersection

giving : 给予：

tall brunette 1
tall blond 1
brunette tall 1
blond tall 1
blond short 1
short blond 1

EDIT 编辑

it is then easy to refine to get your wished output : 然后很容易改进以获得您希望的输出：

import pandas as pd
from itertools import product

# 1) creating pandas dataframe
df = [  ["1234"    ,    "blond"],
        ["1235"    ,    "brunette"],
        ["1236"    ,    "blond"   ],
        ["1234"    ,    "tall"],
        ["1235"    ,    "tall"],
        ["1236"    ,    "short"]]

df = pd.DataFrame(df)
df.columns = ["id", "attribute"]

wanted_attribute_1 = ["blond", "brunette"]

#2) creating all the possible attributes binomes
attributs = set(df.attribute)
for attribut1, attribut2 in product(attributs, attributs):
    if attribut1 in wanted_attribute_1 and attribut2 not in wanted_attribute_1:
        if attribut1!=attribut2:
            #3) selecting the rows for each attribut
            df1 = df[df.attribute == attribut1]["id"]
            df2 = df[df.attribute == attribut2]["id"]
            #4) finding the ids that are matching both attributs 
            intersection= len(set(df1).intersection(set(df2)))
            #5) displaying the number of matches
            print attribut1, attribut2, intersection

giving : 给予：

brunette tall 1
brunette short 0
blond tall 1
blond short 1

Answer 2

From your pivoted table, you can calculate the transposed crossproduct of itself, and then transform the upper triangular result to the long format: 从您的透视表中，您可以计算自身的转置叉积，然后将上三角形结果转换为长格式：

import pandas as pd
import numpy as np
mat = df.pivot_table(index='User ID', columns='Attribute', aggfunc=len, fill_value=0)

tprod = mat.T.dot(mat)          # calculate the tcrossprod here
result = tprod.where((np.triu(np.ones(tprod.shape, bool), 1)), np.nan).stack().rename('value')
                                # extract the upper triangular part
result.index.names = ['Attr1', 'Attr2']
result.reset_index().sort_values('value', ascending = False)

计算两个字符串之间的唯一ID重叠

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-11-02 14:41:23

解决方案2
1 2016-11-02 14:42:44

计算两个字符串之间的唯一ID重叠

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-11-02 14:41:23

解决方案2 1 2016-11-02 14:42:44

解决方案1
1 已采纳 2016-11-02 14:41:23

解决方案2
1 2016-11-02 14:42:44