简体   繁体   中英

Pandas correlation matrix to dictionary of unique index/column combinations

I'm working with this dataset on Pandas, and I'm currently stuck at this step:

I have a dataframe that looks like this:

     id1   id2   id3   id4
id1   1    0.3   0.5   0.2
id2  0.2    1    0.4   0.7
id3   0    0.5    1    0.8
id4  0.6   0.1    0     1

Now, id<num> refers to the post_id of associated with each message. The above cosine similarity matrix is a 2D-array after I did some work on finding the cosine similarity score between those different messages.

The end goal is to display on a UI and aggregate the posts that appear similar to each other. For this, I will need the overview of scores in id1 and id2 , and id1 and id3 and so on.

However, I have now a diagonal of 1's (which makes sense since they are the same), but how can I do this in a better way such that I don't have to use double for loops, and ultimately bring it to a state where I can display it like the below in a dataframe.

What I'm currently doing gives me this:

id1, id1, score
id1, id2, score
id1, id3, score
id1, id4, score
id2, id1, score
id2, id2, score
id2, id3, score
id2, id4, score
id3, id4, score
id3, id1, score
id3, id2, score
id3, id3, score
id4, id1, score
id4, id2, score
id4, id3, score
id4, id4, score

What I want to get is this:

id1, id2, score
id1, id3, score
id1, id4, score
id2, id3, score
id2, id4, score
id3, id4, score

How can I achieve this? Should I reshape the 2D matrix to arrays? I feel like I'm missing something here.

Appreciate your help in this issue.

You could insert NaN s in the upper triangular part of the dataframe, stack to remove those NaN s and build a dict from the iterator returned by pd.Series.iteritems :

a = df.values
a[np.tril_indices(a.shape[0], 0)] = np.nan
df[:] = a # or directly create a new one -> pd.Dat...(a, index=df.index...)
out = dict(df.stack().iteritems())

print(out)
{('id1', 'id2'): 0.3,
 ('id1', 'id3'): 0.5,
 ('id1', 'id4'): 0.2,
 ('id2', 'id3'): 0.4,
 ('id2', 'id4'): 0.7,
 ('id3', 'id4'): 0.8}

Note that with

a = df.values
a[np.tril_indices(a.shape[0], 0)] = np.nan

You're ignoring similarities that you'll already have:

print(a)

[[nan 0.3 0.5 0.2]
 [nan nan 0.4 0.7]
 [nan nan nan 0.8]
 [nan nan nan nan]]

So that by then updating the dataframe, and stacking, you'll end up with the combinations of interest:

df[:] = a
df.stack()

id1  id2    0.3
     id3    0.5
     id4    0.2
id2  id3    0.4
     id4    0.7
id3  id4    0.8
dtype: float64 

You can try stack :

out = df.stack() \ 
        .reset_index() \ 
        .rename(columns={"level_0": "col1", "level_1": "col2", 0: "score"})
out = out[out.col1 != out.col2]

output

print(out)
#    col1 col2  score
# 1   id1  id2    0.3
# 2   id1  id3    0.5
# 3   id1  id4    0.2
# 4   id2  id1    0.2
# 6   id2  id3    0.4
# 7   id2  id4    0.7
# 8   id3  id1    0.0
# 9   id3  id2    0.5
# 11  id3  id4    0.8
# 12  id4  id1    0.6
# 13  id4  id2    0.1
# 14  id4  id3    0.0

Explanation :

  • Use stack to stack all values in one column:
print(df.stack())
# id1  id1    1.0
#      id2    0.3
#      id3    0.5
#      id4    0.2
# id2  id1    0.2
#      id2    1.0
#      id3    0.4
#      id4    0.7
# id3  id1    0.0
#      id2    0.5
#      id3    1.0
#      id4    0.8
# id4  id1    0.6
#      id2    0.1
#      id3    0.0
#      id4    1.0
  • Reset index using reset_index
  • Rename columns using rename
  • Filter output to remove values col1 equals col2

Try this:

1) multiply dataframe by numpy.tril(df.shape) - adjust tril parameters to zero the unwanted part of table

2) pandas.melt() data frame to desired format

3) filter out lines with zeros

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM