I want to calculate the cosine (scipy) distance between two vectors. I originally have a DataFrame with the ' category ' and value for each person .
I want to calculate the distance between persons using the vector with values value indexed by category .
import pandas as pd
from scipy.spatial.distance import cosine
d = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(d)
category person value
0 A 1 1
1 B 1 1
2 C 1 1
3 B 2 1
4 D 2 1
5 E 3 1
6 F 3 1
7 F 4 1
8 D 4 1
I can do this by creating a pivot table like this:
pivot = df.pivot_table(index=['person'], columns='category', values='value', aggfunc='sum', fill_value=0)
index person A B C D E F
0 1 1 1 1 0 0 0
1 2 0 1 0 1 0 0
2 3 0 0 0 0 1 1
3 4 0 0 0 1 0 1
However, I do not want to do this (I am dealing with big vectors so pd.pivot_table can take a while).
How can I do this using the original 'sparse' format in df?
Try this:
In [30]: pd.crosstab(df.person, df.category).reset_index().rename_axis(None, 1)
Out[30]:
person A B C D E F
0 1 1 1 1 0 0 0
1 2 0 1 0 1 0 0
2 3 0 0 0 0 1 1
3 4 0 0 0 1 0 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.