[英]Distance Between Two Vectors As Columns Of Pandas DataFrame
I have a DataFrame which has two vectors as columns. 我有一个DataFrame,其中有两个向量作为列。 I want to produce a third column that is the Euclidean distance between the two vectors.
我想产生第三列,即两个向量之间的欧几里得距离。
I've been using np.linalg.norm, but I've been getting the following ValueError: 我一直在使用np.linalg.norm,但是我一直在获取以下ValueError:
ValueError: Length of values does not match length of index
The following is my DataFrame: 以下是我的DataFrame:
Vectors clusterCenter
0 [-0.56663936, 0.8127105, -3.0935333, 1.2820396... [-0.1343598546941601, 0.763419086816995, -1.48...
1 [-0.8221095, 1.3501785, -1.7770282, -0.4987612... [-0.1343598546941601, 0.763419086816995, -1.48...
2 [-0.2715391, 1.1768106, -1.252441, 1.6287287, ... [-0.1343598546941601, 0.763419086816995, -1.48...
3 [-0.58485925, -0.22501345, -0.9360838, 1.45915... [-0.1343598546941601, 0.763419086816995, -1.48...
4 [-0.44443423, 1.0936267, -1.628864, 0.4971503,... [-0.1343598546941601, 0.763419086816995, -1.48...
The following is the error/stack trace: 以下是错误/堆栈跟踪:
ValueError Traceback (most recent call last)
<ipython-input-181-f32674f361eb> in <module>
4 # profiles_to_cluster['distanceToCenter'][count] = np.linalg.norm(vectors[count]-
5 # cluster_centers[i])
----> 6 profiles_to_cluster2['Distance'] = np.linalg.norm(profiles_to_cluster2['Vectors'] - profiles_to_cluster2['clusterCenter'])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3368 else:
3369 # set column
-> 3370 self._set_item(key, value)
3371
3372 def _setitem_slice(self, key, value):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3443
3444 self._ensure_valid_index(value)
-> 3445 value = self._sanitize_column(key, value)
3446 NDFrame._set_item(self, key, value)
3447
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3628
3629 # turn me into an ndarray
-> 3630 value = sanitize_index(value, self.index, copy=False)
3631 if not isinstance(value, (np.ndarray, Index)):
3632 if isinstance(value, list) and len(value) > 0:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
517
518 if len(data) != len(index):
--> 519 raise ValueError('Length of values does not match length of index')
520
521 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
You can do something like this. 你可以做这样的事情。
>>> x = pd.DataFrame(data=[[[1, 2, 3, 4], [4, 3, 2, 1]],
[[5, 6, 7, 8], [1, 2, 3, 4]]])
>>> (x[0].apply(np.array) - x[1].apply(np.array)).apply(np.linalg.norm)
0 4.472136
1 8.000000
dtype: float64
However, your data format and this method are an abuse of pandas
which is built to handle pan el da ta, hence its name. 但是,您的数据格式和此方法会滥用
pandas
, pandas
是为处理pan el da ta而建立的,因此也因此而得名。 I suggest making two separate dataframes, each of which has one column for each dimension of your vectors. 我建议制作两个单独的数据框,每个矢量框的每一维都有一列。 Then you can simply subtract the two datasets and apply
np.linalg.norm
to each row. 然后,您可以简单地减去两个数据集,并将
np.linalg.norm
应用于每一行。 Like this: 像这样:
>>> # first column as separate DataFrame
>>> x = pd.DataFrame(data=[[1, 2, 3, 4], [5, 6, 7, 8]])
>>> # second column as separate DataFrame
>>> y = pd.DataFrame(data=[[4, 3, 2, 1], [1, 2, 3, 4]])
>>> np.linalg.norm(x - y, axis=1)
array([4.47213595, 8. ])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.