简体   繁体   English

对 dataframe 中的每两列应用 function 并用 output 替换原始列

[英]Apply function to every two columns in dataframe and replace original columns with output

I have a dataframe that contains X & Y data in columns like this:我有一个 dataframe,其中包含如下列中的 X 和 Y 数据:

df_cols = ['x1', 'y1', 'x2', 'y2', 'x3', 'y3']

np.random.seed(365)
df = pd.DataFrame(np.random.randint(0,10,size=(10, 6)), columns=df_cols)

   x1  y1  x2  y2  x3  y3
0   2   4   1   5   2   2
1   9   8   4   0   3   3
2   7   7   7   0   8   4
3   3   2   6   2   6   8
4   9   6   1   6   5   7
5   7   6   5   9   3   8
6   7   9   9   0   1   4
7   0   9   6   5   6   9
8   5   3   2   7   9   2
9   6   6   3   7   7   1

I need to call a function that takes one X & Y pair at a time and returns and updated X & Y pair (same length), and then either save that data to a new dataframe with the original column names, or replace the old X & Y data with the new data and keep the original column names.我需要调用一个 function 一次需要一对 X 和 Y 并返回并更新 X 和 Y 对(相同长度),然后将该数据保存到新的 dataframe 与原始列名,或替换旧的 X &Y 数据用新数据并保留原来的列名。

For example, take this function below:比如下面这个function:

def samplefunc(x, y):
    x = x*y
    y = x/10
    return x, y

# Apply function to each x & y pair 
x1, y1 = samplefunc(df.x1, df.y1)
x2, y2 = samplefunc(df.x2, df.y2)
x3, y3 = samplefunc(df.x3, df.y3)

 # Save new/updated x & y pairs into new dataframe, preserving the original column names 
df_updated = pd.DataFrame({'x1': x1, 'y1': y1, 'x2': x2, 'y2': y2, 'x3': x3, 'y3': y3})

# Desired result:
In [36]: df_updated
Out[36]: 
   x1   y1  x2   y2  x3   y3
0   8  0.8   5  0.5   4  0.4
1  72  7.2   0  0.0   9  0.9
2  49  4.9   0  0.0  32  3.2
3   6  0.6  12  1.2  48  4.8
4  54  5.4   6  0.6  35  3.5
5  42  4.2  45  4.5  24  2.4
6  63  6.3   0  0.0   4  0.4
7   0  0.0  30  3.0  54  5.4
8  15  1.5  14  1.4  18  1.8
9  36  3.6  21  2.1   7  0.7

But doing it this way is obviously really tedious and impossible for a huge dataset.但是对于一个庞大的数据集来说,这样做显然是非常乏味和不可能的。 The similar/related questions I've found perform a simple transformation on the data rather than calling a function, or they add new columns to the dataframe instead of replacing the originals.我发现的类似/相关问题对数据执行简单的转换,而不是调用 function,或者他们向 dataframe 添加新列而不是替换原始列。

I tried to apply @PaulH's answer to my dataset, but neither of them are working as it is unclear how to actually call the function inside of either method.我试图将@PaulH 的答案应用于我的数据集,但它们都没有工作,因为不清楚如何在任一方法中实际调用 function。

# Method 1
array = np.array(my_actual_df)
df_cols = my_actual_df.columns
dist = 0.04 # a parameter I need for my function 
df = (
    pandas.DataFrame(array, columns=df_cols)
        .rename_axis(index='idx', columns='label')
        .stack()
        .to_frame('value')
        .reset_index()
        .assign(value=lambda df: numpy.select(
            [df['label'].str.startswith('x'), df['label'].str.startswith('y')],

            # Call the function (not working): 
            [df['value'], df['value']] = samplefunc(df['value'], df['value']),
        ))
        .pivot(index='idx', columns='label', values='value')
        .loc[:, df_cols]
)



# Method 2
df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .stack(level='group')
         
        # Call the function (not working)
        .assign(df['x'], df['y'] = samplefunc(df['x'], df['y']))
        .unstack(level='group')
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

The actual function I need to call is from Arty's answer to this question: Resample trajectory to have equal euclidean distance in each sample我需要调用的实际 function 来自 Arty 对这个问题的回答:重新采样轨迹以在每个样本中具有相等的欧几里得距离

Use slicing and apply operations on those slices.使用切片并对这些切片应用操作。

def samplefunc(x, y):
    x = x**2
    y = y/10
    return x, y

arr = df.to_numpy().astype(object) 
e_col = arr[:, ::2]
o_col =  arr[:, 1::2]
e_col, o_col = samplefunc(e_col, o_col)
arr[:, ::2] = e_col 
arr[:, 1::2] = o_col 
out = pd.DataFrame(arr, columns=df.columns)

   x1   y1  x2   y2  x3   y3
0   4  0.4   1  0.5   4  0.2
1  81  0.8  16  0.0   9  0.3
2  49  0.7  49  0.0  64  0.4
3   9  0.2  36  0.2  36  0.8
4  81  0.6   1  0.6  25  0.7
5  49  0.6  25  0.9   9  0.8
6  49  0.9  81  0.0   1  0.4
7   0  0.9  36  0.5  36  0.9
8  25  0.3   4  0.7  81  0.2
9  36  0.6   9  0.7  49  0.1

There are couple of ways you could do this, depending on how your real-life dataframe is constructed.有几种方法可以做到这一点,具体取决于您在现实生活中的 dataframe 是如何构造的。

The first thing that comes to my mind is to fully stack the dataframe and the use numpy.select to compute your new values based on the labels' values.我首先想到的是完全堆叠 dataframe 并使用numpy.select根据标签的值计算新值。 You can then pivot the dataframe back to its original form:然后您可以将 pivot 和 dataframe 恢复到原来的形式:

import numpy
import pandas

df_cols = ['x1', 'y1', 'x2', 'y2', 'x3', 'y3']


numpy.random.seed(365)
array = numpy.random.randint(0, 10, size=(10, 6))
df = (
    pandas.DataFrame(array, columns=df_cols)
        .rename_axis(index='idx', columns='label')
        .stack()
        .to_frame('value')
        .reset_index()
        .assign(value=lambda df: numpy.select(
            [df['label'].str.startswith('x'), df['label'].str.startswith('y')],
            [df['value'] ** 2, df['value'] / 10],
        ))
        .pivot(index='idx', columns='label', values='value')
        .loc[:, df_cols]
)
label    x1   y1    x2   y2    x3   y3
idx                                   
0       4.0  0.4   1.0  0.5   4.0  0.2
1      81.0  0.8  16.0  0.0   9.0  0.3
2      49.0  0.7  49.0  0.0  64.0  0.4
3       9.0  0.2  36.0  0.2  36.0  0.8
4      81.0  0.6   1.0  0.6  25.0  0.7
5      49.0  0.6  25.0  0.9   9.0  0.8
6      49.0  0.9  81.0  0.0   1.0  0.4
7       0.0  0.9  36.0  0.5  36.0  0.9
8      25.0  0.3   4.0  0.7  81.0  0.2
9      36.0  0.6   9.0  0.7  49.0  0.1

Alternatively, you could look at your column names as hierarchies, turn it into a multi-level index, and then stack only the second level of that index.或者,您可以将列名视为层次结构,将其转换为多级索引,然后仅堆叠该索引的第二级。 That way, you end up with separate x- and y-column that you can operate on directly and explicitly这样,您最终会得到单独的 x 列和 y 列,您可以直接明确地对其进行操作

df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .stack(level='group')
        .assign(x=lambda df: df['x'] ** 2, y=lambda df: df['y'] / 10)
        .unstack(level='group')
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

New approach here:这里的新方法:

  • split the column into a multilevel index将列拆分为多级索引
  • do a horizontal groupby做一个水平分组
  • modify your samplefunc to take a dataframe:修改您的samplefunc以采用 dataframe:
def samplefunc(df, xcol='x', ycol='y'):
    x = df[xcol].to_numpy()
    y = df[ycol].to_numpy()
    
    df[xcol] = x * y
    df[ycol] = x / 10
    return df

df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .groupby(level='group', axis='columns')
        .apply(samplefunc)
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

And I get:我得到:

   x1   y1  x2   y2  x3   y3
0   8  0.8   5  0.5   4  0.4
1  72  7.2   0  0.0   9  0.9
2  49  4.9   0  0.0  32  3.2
3   6  0.6  12  1.2  48  4.8
4  54  5.4   6  0.6  35  3.5
5  42  4.2  45  4.5  24  2.4
6  63  6.3   0  0.0   4  0.4
7   0  0.0  30  3.0  54  5.4
8  15  1.5  14  1.4  18  1.8
9  36  3.6  21  2.1   7  0.7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM