简体   繁体   中英

Calculate distance between sparse vectors

I want to calculate the cosine (scipy) distance between two vectors. I originally have a DataFrame with the ' category ' and value for each person .

I want to calculate the distance between persons using the vector with values value indexed by category .

import pandas as pd
from scipy.spatial.distance import cosine

d = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
 'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
  'value' : [1, 1, 1, 1, 1, 1, 1, 1, 1]}

df = pd.DataFrame(d)

  category person  value
0        A      1      1
1        B      1      1
2        C      1      1
3        B      2      1
4        D      2      1
5        E      3      1
6        F      3      1
7        F      4      1
8        D      4      1

I can do this by creating a pivot table like this:

pivot = df.pivot_table(index=['person'], columns='category', values='value', aggfunc='sum', fill_value=0)

index person  A  B  C  D  E  F
0          1  1  1  1  0  0  0
1          2  0  1  0  1  0  0
2          3  0  0  0  0  1  1
3          4  0  0  0  1  0  1

However, I do not want to do this (I am dealing with big vectors so pd.pivot_table can take a while).

How can I do this using the original 'sparse' format in df?

Try this:

In [30]: pd.crosstab(df.person, df.category).reset_index().rename_axis(None, 1)
Out[30]:
  person  A  B  C  D  E  F
0      1  1  1  1  0  0  0
1      2  0  1  0  1  0  0
2      3  0  0  0  0  1  1
3      4  0  0  0  1  0  1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM