如何在pandas數據框上應用已定義的函數

Question

我定義了以下函數，它正在處理2d數組。 angle函數計算矢量之間的角度。

在調用下面的函數時，它將“方向”作為參數，這是一個二維數組（其中2個字符串一個帶有x個值，另一個帶有y個值）。

現在通過應用np.diff()函數2d數組獲得directions 。

import matplotlib.pyplot as plt
import numpy as np
import os
import rdp

def angle(dir):
    """
    Returns the angles between vectors.

    Parameters:
    dir is a 2D-array of shape (N,M) representing N vectors in M-dimensional space.

    The return value is a 1D-array of values of shape (N-1,), with each value between 0 and pi.

    0 implies the vectors point in the same direction
    pi/2 implies the vectors are orthogonal
    pi implies the vectors point in opposite directions
    """
    dir2 = dir[1:]
    dir1 = dir[:-1]
    return np.arccos((dir1*dir2).sum(axis=1)/(np.sqrt((dir1**2).sum(axis=1)*(dir2**2).sum(axis=1))))

tolerance = 70
min_angle = np.pi*0.22

filename = os.path.expanduser('~/tmp/bla.data')
points = np.genfromtxt(filename).T
print(len(points))
x, y = points.T

# Use the Ramer-Douglas-Peucker algorithm to simplify the path
# http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm
# Python implementation: https://github.com/sebleier/RDP/
simplified = np.array(rdp.rdp(points.tolist(), tolerance))

print(len(simplified))
sx, sy = simplified.T

# compute the direction vectors on the simplified curve
directions = np.diff(simplified, axis=0)
theta = angle(directions)

# Select the index of the points with the greatest theta
# Large theta is associated with greatest change in direction.
idx = np.where(theta>min_angle)[0]+1

我想在帶有軌跡數據的pandas.DataFrame上實現上面的代碼。

以下是樣本df 。 sx ，屬於同一subid sy值被認為是一個軌跡，比如行（0-3）具有與2相同的subid ， id為11被認為是軌跡上的點。 行（4-6）是一個軌跡，因此一個。 因此，每當subid或id改變時，就會找到單獨的軌跡數據。

  id      subid     simplified_points     sx       sy
0 11      2         (3,4)                 3        4
1 11      2         (5,6)                 5        6
2 11      2         (7,8)                 7        8
3 11      2         (9,9)                 9        9
4 11      3         (10,12)               10       12
5 11      3         (12,14)               12       14
6 11      3         (13,15)               13       15
7 12      9         (18,20)               18       20
8 12      9         (22,24)               22       24
9 12      9         (25,27)               25       27

在已經應用rdp算法之后已經獲得了上述數據幀。 simplified_points進一步解壓縮成兩列sx和sy是rdp算法的結果。

問題在於獲得每個軌跡的directions ，然后獲得theta和idx 。 由於上面的代碼只針對一個軌跡實現，而且二維陣列也是如此，我無法為上面的pandas數據幀實現它。

請建議我為df中的每個軌跡數據實現上述代碼的方法。

Answer 1

你可以使用pandas.DataFrame.groupby.apply()來處理每個(id, subid) ，例如：

碼：

def theta(group):
    dx = pd.Series(group.sx.diff(), name='dx')
    dy = pd.Series(group.sy.diff(), name='dy')
    theta = pd.Series(np.arctan2(dy, dx), name='theta')
    return pd.concat([dx, dy, theta], axis=1)

df2 = df.groupby(['id', 'subid']).apply(theta)

測試代碼：

df = pd.read_fwf(StringIO(u"""
    id      subid     simplified_points     sx       sy
    11      2         (3,4)                 3        4
    11      2         (5,6)                 5        6
    11      2         (7,8)                 7        8
    11      2         (9,9)                 9        9
    11      3         (10,12)               10       12
    11      3         (12,14)               12       14
    11      3         (13,15)               13       15
    12      9         (18,20)               18       20
    12      9         (22,24)               22       24
    12      9         (25,27)               25       27"""),
                 header=1)

df2 = df.groupby(['id', 'subid']).apply(theta)
df = pd.concat([df, pd.DataFrame(df2.values, columns=df2.columns)], axis=1)
print(df)

結果：

   id  subid simplified_points  sx  sy   dx   dy     theta
0  11      2             (3,4)   3   4  NaN  NaN       NaN
1  11      2             (5,6)   5   6  2.0  2.0  0.785398
2  11      2             (7,8)   7   8  2.0  2.0  0.785398
3  11      2             (9,9)   9   9  2.0  1.0  0.463648
4  11      3           (10,12)  10  12  NaN  NaN       NaN
5  11      3           (12,14)  12  14  2.0  2.0  0.785398
6  11      3           (13,15)  13  15  1.0  1.0  0.785398
7  12      9           (18,20)  18  20  NaN  NaN       NaN
8  12      9           (22,24)  22  24  4.0  4.0  0.785398
9  12      9           (25,27)  25  27  3.0  3.0  0.785398

如何在pandas數據框上應用已定義的函數

問題描述

1 個解決方案

解決方案1
2 已采納 2017-05-07 18:34:13

如何在pandas數據框上應用已定義的函數

問題描述

1 個解決方案

解決方案1 2 已采納 2017-05-07 18:34:13

解決方案1
2 已采納 2017-05-07 18:34:13