作为熊猫数据框的列的两个向量之间的距离

Question

I have a DataFrame which has two vectors as columns. 我有一个DataFrame，其中有两个向量作为列。 I want to produce a third column that is the Euclidean distance between the two vectors. 我想产生第三列，即两个向量之间的欧几里得距离。

I've been using np.linalg.norm, but I've been getting the following ValueError: 我一直在使用np.linalg.norm，但是我一直在获取以下ValueError：

ValueError: Length of values does not match length of index

The following is my DataFrame: 以下是我的DataFrame：

Vectors clusterCenter
0   [-0.56663936, 0.8127105, -3.0935333, 1.2820396...   [-0.1343598546941601, 0.763419086816995, -1.48...
1   [-0.8221095, 1.3501785, -1.7770282, -0.4987612...   [-0.1343598546941601, 0.763419086816995, -1.48...
2   [-0.2715391, 1.1768106, -1.252441, 1.6287287, ...   [-0.1343598546941601, 0.763419086816995, -1.48...
3   [-0.58485925, -0.22501345, -0.9360838, 1.45915...   [-0.1343598546941601, 0.763419086816995, -1.48...
4   [-0.44443423, 1.0936267, -1.628864, 0.4971503,...   [-0.1343598546941601, 0.763419086816995, -1.48...

The following is the error/stack trace: 以下是错误/堆栈跟踪：

ValueError                                Traceback (most recent call last)
<ipython-input-181-f32674f361eb> in <module>
      4 #    profiles_to_cluster['distanceToCenter'][count] = np.linalg.norm(vectors[count]-
      5 #                                                                cluster_centers[i])
----> 6 profiles_to_cluster2['Distance'] = np.linalg.norm(profiles_to_cluster2['Vectors'] - profiles_to_cluster2['clusterCenter'])

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3368         else:
   3369             # set column
-> 3370             self._set_item(key, value)
   3371 
   3372     def _setitem_slice(self, key, value):

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3443 
   3444         self._ensure_valid_index(value)
-> 3445         value = self._sanitize_column(key, value)
   3446         NDFrame._set_item(self, key, value)
   3447 

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3628 
   3629             # turn me into an ndarray
-> 3630             value = sanitize_index(value, self.index, copy=False)
   3631             if not isinstance(value, (np.ndarray, Index)):
   3632                 if isinstance(value, list) and len(value) > 0:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
    517 
    518     if len(data) != len(index):
--> 519         raise ValueError('Length of values does not match length of index')
    520 
    521     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

Answer 1

You can do something like this. 你可以做这样的事情。

>>> x = pd.DataFrame(data=[[[1, 2, 3, 4], [4, 3, 2, 1]],
                          [[5, 6, 7, 8], [1, 2, 3, 4]]])

>>> (x[0].apply(np.array) - x[1].apply(np.array)).apply(np.linalg.norm)
0    4.472136
1    8.000000
dtype: float64

However, your data format and this method are an abuse of pandas which is built to handle pan el da ta, hence its name. 但是，您的数据格式和此方法会滥用pandas ， pandas是为处理pan el da ta而建立的，因此也因此而得名。 I suggest making two separate dataframes, each of which has one column for each dimension of your vectors. 我建议制作两个单独的数据框，每个矢量框的每一维都有一列。 Then you can simply subtract the two datasets and apply np.linalg.norm to each row. 然后，您可以简单地减去两个数据集，并将np.linalg.norm应用于每一行。 Like this: 像这样：

>>> # first column as separate DataFrame
>>> x = pd.DataFrame(data=[[1, 2, 3, 4], [5, 6, 7, 8]])
>>> # second column as separate DataFrame
>>> y = pd.DataFrame(data=[[4, 3, 2, 1], [1, 2, 3, 4]])
>>> np.linalg.norm(x - y, axis=1)
array([4.47213595, 8.        ])

作为熊猫数据框的列的两个向量之间的距离

问题描述

1 个解决方案

解决方案1
0 2019-08-07 14:02:37

作为熊猫数据框的列的两个向量之间的距离

问题描述

1 个解决方案

解决方案1 0 2019-08-07 14:02:37

解决方案1
0 2019-08-07 14:02:37