I have a dataframe that each line are my samples and columns are my features and i would like to calculate the mean value of my dataframe rows and afterwards the euclidean distance between the dataframe samples to the mean value.
For example:
df = pd.DataFrame(np.random.randn(10, 5), columns=list([1, 2, 3]))
For the given dataframe above, first i would like to compute the mean row value, where for this example will be a (1, 3) mean_array
. Next i would like to return the distance between the 10 samples to the mean value in my dataframe which will be a (10, 3)
output.
How can i do it, in a simple way?
Use df.mean()
to calculate the centroid. Then things are straightforward:
dist_to_centroid = np.sqrt((df - df.mean())**2).sum(axis=1)
Output:
0 1.614658
1 4.234299
2 0.665248
3 3.649749
4 2.828436
5 2.281306
6 3.493792
7 4.165420
8 2.299944
9 2.793936
dtype: float64
I think the answer from @Quang Hoang is incorrect. Their answer provides Manhattan Distance, however OP has asked for Euclidean. Just change the placement of the brackets, as the sqrt
needs to encapsulate the sum
.
dist_to_centroid = np.sqrt(((df - df.mean())**2).sum(axis=1))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.