如何在pandas数据框上应用已定义的函数

Question

我定义了以下函数，它正在处理2d数组。 angle函数计算矢量之间的角度。

在调用下面的函数时，它将“方向”作为参数，这是一个二维数组（其中2个字符串一个带有x个值，另一个带有y个值）。

现在通过应用np.diff()函数2d数组获得directions 。

import matplotlib.pyplot as plt
import numpy as np
import os
import rdp

def angle(dir):
    """
    Returns the angles between vectors.

    Parameters:
    dir is a 2D-array of shape (N,M) representing N vectors in M-dimensional space.

    The return value is a 1D-array of values of shape (N-1,), with each value between 0 and pi.

    0 implies the vectors point in the same direction
    pi/2 implies the vectors are orthogonal
    pi implies the vectors point in opposite directions
    """
    dir2 = dir[1:]
    dir1 = dir[:-1]
    return np.arccos((dir1*dir2).sum(axis=1)/(np.sqrt((dir1**2).sum(axis=1)*(dir2**2).sum(axis=1))))

tolerance = 70
min_angle = np.pi*0.22

filename = os.path.expanduser('~/tmp/bla.data')
points = np.genfromtxt(filename).T
print(len(points))
x, y = points.T

# Use the Ramer-Douglas-Peucker algorithm to simplify the path
# http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm
# Python implementation: https://github.com/sebleier/RDP/
simplified = np.array(rdp.rdp(points.tolist(), tolerance))

print(len(simplified))
sx, sy = simplified.T

# compute the direction vectors on the simplified curve
directions = np.diff(simplified, axis=0)
theta = angle(directions)

# Select the index of the points with the greatest theta
# Large theta is associated with greatest change in direction.
idx = np.where(theta>min_angle)[0]+1

我想在带有轨迹数据的pandas.DataFrame上实现上面的代码。

以下是样本df 。 sx ，属于同一subid sy值被认为是一个轨迹，比如行（0-3）具有与2相同的subid ， id为11被认为是轨迹上的点。 行（4-6）是一个轨迹，因此一个。 因此，每当subid或id改变时，就会找到单独的轨迹数据。

  id      subid     simplified_points     sx       sy
0 11      2         (3,4)                 3        4
1 11      2         (5,6)                 5        6
2 11      2         (7,8)                 7        8
3 11      2         (9,9)                 9        9
4 11      3         (10,12)               10       12
5 11      3         (12,14)               12       14
6 11      3         (13,15)               13       15
7 12      9         (18,20)               18       20
8 12      9         (22,24)               22       24
9 12      9         (25,27)               25       27

在已经应用rdp算法之后已经获得了上述数据帧。 simplified_points进一步解压缩成两列sx和sy是rdp算法的结果。

问题在于获得每个轨迹的directions ，然后获得theta和idx 。 由于上面的代码只针对一个轨迹实现，而且二维阵列也是如此，我无法为上面的pandas数据帧实现它。

请建议我为df中的每个轨迹数据实现上述代码的方法。

Answer 1

你可以使用pandas.DataFrame.groupby.apply()来处理每个(id, subid) ，例如：

码：

def theta(group):
    dx = pd.Series(group.sx.diff(), name='dx')
    dy = pd.Series(group.sy.diff(), name='dy')
    theta = pd.Series(np.arctan2(dy, dx), name='theta')
    return pd.concat([dx, dy, theta], axis=1)

df2 = df.groupby(['id', 'subid']).apply(theta)

测试代码：

df = pd.read_fwf(StringIO(u"""
    id      subid     simplified_points     sx       sy
    11      2         (3,4)                 3        4
    11      2         (5,6)                 5        6
    11      2         (7,8)                 7        8
    11      2         (9,9)                 9        9
    11      3         (10,12)               10       12
    11      3         (12,14)               12       14
    11      3         (13,15)               13       15
    12      9         (18,20)               18       20
    12      9         (22,24)               22       24
    12      9         (25,27)               25       27"""),
                 header=1)

df2 = df.groupby(['id', 'subid']).apply(theta)
df = pd.concat([df, pd.DataFrame(df2.values, columns=df2.columns)], axis=1)
print(df)

结果：

   id  subid simplified_points  sx  sy   dx   dy     theta
0  11      2             (3,4)   3   4  NaN  NaN       NaN
1  11      2             (5,6)   5   6  2.0  2.0  0.785398
2  11      2             (7,8)   7   8  2.0  2.0  0.785398
3  11      2             (9,9)   9   9  2.0  1.0  0.463648
4  11      3           (10,12)  10  12  NaN  NaN       NaN
5  11      3           (12,14)  12  14  2.0  2.0  0.785398
6  11      3           (13,15)  13  15  1.0  1.0  0.785398
7  12      9           (18,20)  18  20  NaN  NaN       NaN
8  12      9           (22,24)  22  24  4.0  4.0  0.785398
9  12      9           (25,27)  25  27  3.0  3.0  0.785398

如何在pandas数据框上应用已定义的函数

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-07 18:34:13

如何在pandas数据框上应用已定义的函数

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-07 18:34:13

解决方案1
2 已采纳 2017-05-07 18:34:13