如何在pandas数据框上应用已定义的函数

Question

I have a following function defined, which is working on 2d arrays. 我定义了以下函数，它正在处理2d数组。 The angle function is calculating the angle between vectors. angle函数计算矢量之间的角度。

While calling the function below, its taking in "directions" as the parameter, which is a 2d array (with 2 cols one with x vals and another with y vals). 在调用下面的函数时，它将“方向”作为参数，这是一个二维数组（其中2个字符串一个带有x个值，另一个带有y个值）。

Now directions was obtained by applying np.diff() function 2d array. 现在通过应用np.diff()函数2d数组获得directions 。

import matplotlib.pyplot as plt
import numpy as np
import os
import rdp

def angle(dir):
    """
    Returns the angles between vectors.

    Parameters:
    dir is a 2D-array of shape (N,M) representing N vectors in M-dimensional space.

    The return value is a 1D-array of values of shape (N-1,), with each value between 0 and pi.

    0 implies the vectors point in the same direction
    pi/2 implies the vectors are orthogonal
    pi implies the vectors point in opposite directions
    """
    dir2 = dir[1:]
    dir1 = dir[:-1]
    return np.arccos((dir1*dir2).sum(axis=1)/(np.sqrt((dir1**2).sum(axis=1)*(dir2**2).sum(axis=1))))

tolerance = 70
min_angle = np.pi*0.22

filename = os.path.expanduser('~/tmp/bla.data')
points = np.genfromtxt(filename).T
print(len(points))
x, y = points.T

# Use the Ramer-Douglas-Peucker algorithm to simplify the path
# http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm
# Python implementation: https://github.com/sebleier/RDP/
simplified = np.array(rdp.rdp(points.tolist(), tolerance))

print(len(simplified))
sx, sy = simplified.T

# compute the direction vectors on the simplified curve
directions = np.diff(simplified, axis=0)
theta = angle(directions)

# Select the index of the points with the greatest theta
# Large theta is associated with greatest change in direction.
idx = np.where(theta>min_angle)[0]+1

I want to implement the above code on a pandas.DataFrame with trajectory data. 我想在带有轨迹数据的pandas.DataFrame上实现上面的代码。

Below is the sample df . 以下是样本df 。 sx , sy values belonging to the same subid are considered to be one trajectory, say row(0-3) are having the same subid as 2, and id as 11 is considered to be the points of on trajectory. sx ，属于同一subid sy值被认为是一个轨迹，比如行（0-3）具有与2相同的subid ， id为11被认为是轨迹上的点。 Rows (4-6) is one trajectory and so one. 行（4-6）是一个轨迹，因此一个。 Therefore, whenever the subid or id changes, separate trajectory data is found. 因此，每当subid或id改变时，就会找到单独的轨迹数据。

  id      subid     simplified_points     sx       sy
0 11      2         (3,4)                 3        4
1 11      2         (5,6)                 5        6
2 11      2         (7,8)                 7        8
3 11      2         (9,9)                 9        9
4 11      3         (10,12)               10       12
5 11      3         (12,14)               12       14
6 11      3         (13,15)               13       15
7 12      9         (18,20)               18       20
8 12      9         (22,24)               22       24
9 12      9         (25,27)               25       27

The above data frame has been obtained after already applying the rdp algorithm. 在已经应用rdp算法之后已经获得了上述数据帧。 The simplified_points further unzipped into two columns sx and sy are the result of rdp algo. simplified_points进一步解压缩成两列sx和sy是rdp算法的结果。

The problem lies in getting the directions for each of these trajectories and then subsequently getting theta and idx . 问题在于获得每个轨迹的directions ，然后获得theta和idx 。 Since the above code has been implemented only for one trajectory and that too on 2d array, I am unable to implement it for above pandas data frame. 由于上面的代码只针对一个轨迹实现，而且二维阵列也是如此，我无法为上面的pandas数据帧实现它。

Please suggest me a way to implement the above code for each trajectory data in a df. 请建议我为df中的每个轨迹数据实现上述代码的方法。

Answer 1

You can you use pandas.DataFrame.groupby.apply() to work on each (id, subid) , with something like: 你可以使用pandas.DataFrame.groupby.apply()来处理每个(id, subid) ，例如：

Code: 码：

def theta(group):
    dx = pd.Series(group.sx.diff(), name='dx')
    dy = pd.Series(group.sy.diff(), name='dy')
    theta = pd.Series(np.arctan2(dy, dx), name='theta')
    return pd.concat([dx, dy, theta], axis=1)

df2 = df.groupby(['id', 'subid']).apply(theta)

Test Code: 测试代码：

df = pd.read_fwf(StringIO(u"""
    id      subid     simplified_points     sx       sy
    11      2         (3,4)                 3        4
    11      2         (5,6)                 5        6
    11      2         (7,8)                 7        8
    11      2         (9,9)                 9        9
    11      3         (10,12)               10       12
    11      3         (12,14)               12       14
    11      3         (13,15)               13       15
    12      9         (18,20)               18       20
    12      9         (22,24)               22       24
    12      9         (25,27)               25       27"""),
                 header=1)

df2 = df.groupby(['id', 'subid']).apply(theta)
df = pd.concat([df, pd.DataFrame(df2.values, columns=df2.columns)], axis=1)
print(df)

Results: 结果：

   id  subid simplified_points  sx  sy   dx   dy     theta
0  11      2             (3,4)   3   4  NaN  NaN       NaN
1  11      2             (5,6)   5   6  2.0  2.0  0.785398
2  11      2             (7,8)   7   8  2.0  2.0  0.785398
3  11      2             (9,9)   9   9  2.0  1.0  0.463648
4  11      3           (10,12)  10  12  NaN  NaN       NaN
5  11      3           (12,14)  12  14  2.0  2.0  0.785398
6  11      3           (13,15)  13  15  1.0  1.0  0.785398
7  12      9           (18,20)  18  20  NaN  NaN       NaN
8  12      9           (22,24)  22  24  4.0  4.0  0.785398
9  12      9           (25,27)  25  27  3.0  3.0  0.785398

如何在pandas数据框上应用已定义的函数

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-07 18:34:13

如何在pandas数据框上应用已定义的函数

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-07 18:34:13

解决方案1
2 已采纳 2017-05-07 18:34:13