将 dataframe 中的 arrays 拉伸为相同大小

Question

我有一个名为 allData 的 dataframe 包含数千行参与者数据，其中每一行包含：单个试验的参与者编号（单个值）和一个 numpy 数组 x,y 轨迹数据（始终为 2 列，但范围为 100-900 行长度）。 dataframe allData 看起来像这样：

   participantNum   data
0         6432024  [[-16.0, 5.0], [-16.0, 5.0], [-15.83345, 5.0],...
1         1039607  [[-16.0, 5.0], [-16.0, 5.0], [-16.0, 5.0], [-1...
2         6950203  [[-16.0, 5.0], [-16.0, 5.0], [-16.0, 5.0], [-1...
3         8486566  [[-16.0, -5.0], [-16.0, -5.0], [-16.0, -5.0], ...
4         1315866  [[-16.0, 5.0], [-16.0, 5.0], [-16.0, 5.0], [-1...
5         8593676  [[-16.0, 5.0], [-16.0, 5.0], [-16.0, 5.0], [-1...
6         9526582  [[-16.0, 5.0], [-16.0, 5.0], [-16.0, 5.0], [-1...
7         6432024  [[-16.0, 5.0], [-16.0, 5.0], [-16.0, 5.0], [-1...
8         9719645  [[-16.0, -5.0], [-16.0, -5.0], [-16.0, -5.0], ...
9         7830381  [[-16.0, -5.0], [-16.0, -5.0], [-16.0, -5.0], ...

如果我用 XY = allData.iloc[1].data 隔离 allData 行的 x,y 数据之一，它看起来像这样：

   [-16.        ,   5.        ],
   [-15.8315    ,   5.        ],
   [-15.6705    ,   5.        ],
   [-15.5039    ,   5.        ],
   [-15.3373    ,   5.        ],
   [-15.1691    ,   5.        ],
   [-14.8319    ,   5.        ],
   [-14.671     ,   5.        ],
   [-14.5054    ,   5.        ],
   [-14.33635   ,   5.        ],
   [-14.1707    ,   5.        ],
   [-13.8324    ,   5.        ],
   [-13.66605   ,   5.000121  ],
   [-13.50385   ,   5.000464  ],
   [-13.33785   ,   5.001173  ],
   [-13.1701    ,   5.002377  ],
   [-12.83478   ,   5.00674   ],

我需要遍历 dataFrame allData 的所有行并拉伸 X,Y arrays 所以它们的长度都是 1000 行（我想我需要插值才能做到这一点？）。 我希望拉伸/插值后的轨迹看起来相同，只是更多的数据点填充了空间。

我试过使用 resampy 和 interp1d 但我只是在努力弄清楚。

Answer 1

此 function 获取每一行并根据要求为每行中的每个 x、y 值插入N=1000点：

import pandas as pd
import numpy as np
from scipy.interpolate import interp1d

# gen dummy data:
_N = 20
data = []

for _ in range(_N):
    l = np.random.choice(np.arange(100, 900))
    xy = np.array([np.arange(l), np.arange(l)]).T + np.random.random(size=(l, 2))
    data.append([np.random.choice(np.arange(100000, 999999)), xy])
allData = pd.DataFrame(data, columns=["participant", "data"])

# function that does the interpolation
def gen_records(arr, N=1000):
    # interpolate arr over `N` evenly spaced points
    min_val = np.min(arr)
    max_val = np.max(arr)

    t_orig = np.linspace(min_val, max_val, len(arr))
    t_interp = np.linspace(min_val, max_val, N)
    f = interp1d(x=t_orig, y=arr)
    interp_arr = f(t_interp)
    return interp_arr

# apply to dataframe    
allData["interp_data"] = allData.data.apply(
    lambda ser: np.array([gen_records(ser[:, 0]), gen_records(ser[:, 1])]).T
)

要注意：

如果您的样本数据（行）的完整时间跨度不同，这可能会产生误导，因为interp_data的时间增量将不同
您将嵌套的 arrays 存储在 df 的每一行中； 这不是 pandas/numpy 的预期用途。 您将从“扁平化”中受益，因此每个元素都是一个值。

例如，使用以下数据会更容易：

df_x = allData.apply(lambda x: pd.Series(x.interp_data[:,0]), axis=1).T
col_dict = {idx: str(part) + "_x" for idx, part in zip(allData.index, allData.participant)}
df_x.rename(columns=col_dict, inplace=True)
df_y = allData.apply(lambda x: pd.Series(x.interp_data[:,1]), axis=1).T
col_dict = {idx: str(part) + "_y" for idx, part in zip(allData.index, allData.participant)}
df_y.rename(columns=col_dict, inplace=True)

df = pd.concat([df_x, df_y], axis=1)
df = df.sort_index(1)

预习：

       127877_x    127877_y    157700_x    157700_y    192204_x    192204_y  ...    743568_x    743568_y    805716_x    805716_y    805971_x    805971_y
0      0.090315    0.710476    0.992240    0.120537    0.552173    0.470253  ...    0.858416    0.206509    0.340299    0.182788    0.280935    0.998095
1      0.949895    1.661619    1.557746    1.009242    1.023637    1.072346  ...    0.977083    0.472867    0.630777    0.494193    1.417067    1.586117
2      2.344848    2.088494    2.053382    1.654844    1.495102    1.674438  ...    1.095750    0.739226    0.921255    0.805599    2.283156    2.184530
3      3.035018    2.977106    2.456115    1.977199    1.834009    2.194734  ...    1.214417    1.005584    1.259526    1.145122    2.891425    2.810716
4      3.650583    4.146497    3.280621    2.600666    2.108019    2.674983  ...    1.333084    1.271942    1.796585    1.601598    3.198264    3.511372
..          ...         ...         ...         ...         ...         ...  ...         ...         ...         ...         ...         ...         ...
995  844.979568  845.122972  634.268262  633.938912  427.674833  427.567986  ...  170.785379  170.618781  355.071745  355.069744  716.770732  716.554800
996  845.950312  845.685666  634.948427  634.520986  428.257304  428.351386  ...  171.069453  170.947405  355.400044  355.306856  717.315723  717.113295
997  846.831059  846.751367  635.483162  635.045098  428.803142  429.014094  ...  171.353527  171.276028  355.789927  355.801735  717.761773  717.753674
998  847.517356  848.121364  636.024521  635.743284  429.274156  429.430282  ...  171.637600  171.604652  356.194615  356.358587  718.339794  718.670332

您的行是时间，每列是participant_x x 或participant_y系列。

将 dataframe 中的 arrays 拉伸为相同大小

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-01 05:25:01

将 dataframe 中的 arrays 拉伸为相同大小

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-01 05:25:01

解决方案1
1 已采纳 2021-02-01 05:25:01