简体   繁体   English

熊猫数据框字符串格式

[英]Pandas dataframe string formatting

I have a pandas dataframe with multiple columns. 我有一个多列的pandas数据框。 My goal is to apply a complicated function to 3 columns and get a new column of values. 我的目标是将复杂的函数应用于3列并获取新的值列。 Yet I will want to apply the same function to different triplets of columns. 但是我将要对不同的三列列应用相同的功能。 Would there be a possibility to use smart string formatting so I don't have to hardcode different names of columns 5 (or more) times? 是否有可能使用智能字符串格式,所以我不必对5个(或更多)列的不同名称进行硬编码?

Rough sketch: Columns('A1','A2','A3','B1','B2','B3',...) 粗略草图:列('A1','A2','A3','B1','B2','B3',...)

def function(row):
    return row['A1']**2 + row['A2']**3 + row['A3']**4 ### String format here? 

do same for B1,2,3; 对B1,2,3进行相同操作; C1,2,3 etc. C1,2,3等

Thank you! 谢谢!

Using @Milo's setup dataframe df 使用@Milo的设置数据框df

np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print(df)

     A1    A2    A3    B1    B2    B3    C1    C2    C3
0  0.37  0.95  0.73  0.60  0.16  0.16  0.06  0.87  0.60
1  0.71  0.02  0.97  0.83  0.21  0.18  0.18  0.30  0.52
2  0.43  0.29  0.61  0.14  0.29  0.37  0.46  0.79  0.20
3  0.51  0.59  0.05  0.61  0.17  0.07  0.95  0.97  0.81
4  0.30  0.10  0.68  0.44  0.12  0.50  0.03  0.91  0.26

Then use groupby with columns or axis=1 . 然后将groupby与column或axis=1 We use the first letter in the column header as the grouping key. 我们使用列标题中的第一个字母作为分组关键字。

df.pow(2).groupby(df.columns.str[0], 1).sum(axis=1).pow(.5)

          A         B         C
0  1.256962  0.638019  1.055923
1  1.201048  0.878128  0.633695
2  0.803589  0.488905  0.929715
3  0.785843  0.634367  1.576812
4  0.755317  0.673667  0.946051

If I understand your question correctly, you want to name your columns according to a specific scheme like "A number " and then apply the same operation to them. 如果我正确理解了您的问题,则希望根据特定的方案(例如“ 数字 ”)为列命名,然后对它们进行相同的操作。

One way you can do that is to filter for the naming scheme of the columns you want to address by using regular expressions and then use the apply method to apply your function. 一种方法是使用正则表达式过滤要寻址的列的命名方案,然后使用apply方法应用函数。

Let's look at an example. 让我们看一个例子。 I will first construct a DataFrame like so: 我将首先像这样构造一个DataFrame:

import pandas as pd
import numpy as np

np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print df

         A1        A2        A3        B1        B2        B3        C1  \
0  0.374540  0.950714  0.731994  0.598658  0.156019  0.155995  0.058084
1  0.708073  0.020584  0.969910  0.832443  0.212339  0.181825  0.183405
2  0.431945  0.291229  0.611853  0.139494  0.292145  0.366362  0.456070
3  0.514234  0.592415  0.046450  0.607545  0.170524  0.065052  0.948886
4  0.304614  0.097672  0.684233  0.440152  0.122038  0.495177  0.034389

         C2        C3
0  0.866176  0.601115
1  0.304242  0.524756
2  0.785176  0.199674
3  0.965632  0.808397
4  0.909320  0.258780

Then use the filter method in combination with regular expressions. 然后将filter方法与正则表达式结合使用。 I will exemplarily square every value by using a lambda . 我将使用lambda示例性地对每个值求平方。 But you can use whatever function/operation you like: 但是您可以使用任何喜欢的功能/操作:

print df.filter(regex=r'A\d+').apply(lambda x: x*x)

         A1        A2        A3
0  0.140280  0.903858  0.535815
1  0.501367  0.000424  0.940725
2  0.186576  0.084814  0.374364
3  0.264437  0.350955  0.002158
4  0.092790  0.009540  0.468175

Edit (2017-07-10) 编辑(2017-07-10)

Taking the above examples you could proceed with what you ultimately want to calculate. 以上述示例为例,您可以继续进行最终要计算的内容。 For example we can calculate the euclidean distance across all A -columns as follows: 例如,我们可以计算出所有A列上的欧式距离,如下所示:

df.filter(regex=r'A\d+').apply(lambda x: x*x).sum(axis=1).apply(np.sqrt)

Which results in: 结果是:

0    1.256962
1    1.201048
2    0.803589
3    0.785843
4    0.755317

So what we essentially computed is sqrt(A1^2 + A2^2 + A3^2 + ... + An^2) for every row. 因此,我们实际上计算的是每行的sqrt(A1 ^ 2 + A2 ^ 2 + A3 ^ 2 + ... + An ^ 2)。

But since you want to apply separate transformations to separate column naming schemes you would have to hardcode the above method concatenation. 但是,由于要对单独的列命名方案应用单独的转换,因此必须对上述方法串联进行硬编码。

A much more elegant solution to this would be using pipelines . 一个更好的解决方案是使用管道 Pipelines basically allow you to define operations on your DataFrame and then combine them the way you need. 管道基本上允许您在DataFrame上定义操作,然后按照需要的方式组合它们。 Again using the example of computing the Euclidean Distance, we could construct a pipeline as follows: 再次使用计算欧几里得距离的示例,我们可以构建如下的管道:

def filter_columns(dataframe, regex):
    """Filter out columns of `dataframe` matched by `regex`."""
    return dataframe.filter(regex=regex)

def op_on_vals(dataframe, op_vals):
    """Apply `op_vals` to every value in the columns of `dataframe`"""
    return dataframe.apply(op_vals)

def op_across_columns(dataframe, op_cols):
    """Apply `op_cols` across the columns of `dataframe`"""

    # Catch exception that would be raised if function
    # would be applied to a pandas.Series.
    try:
        return dataframe.apply(op_cols, axis=1)
    except TypeError:
        return dataframe.apply(op_cols)

For every column naming scheme you can then define the transformations to apply and the order in which they have to be applied. 然后,对于每种列命名方案,您都可以定义要应用的转换以及必须应用的转换顺序。 This can for example be done by creating a dictionary that holds the column naming schemes as keys and the arguments for the pipes as values: 例如,这可以通过创建一个字典来完成,该字典将列命名方案作为键,将管道的参数作为值:

pipe_dict = {r'A\d+': [(op_on_vals, np.square), (op_across_columns, np.sum), (op_across_columns, np.sqrt)],
             r'B\d+': [(op_on_vals, np.square), (op_across_columns, np.mean)],
             r'C\d+': [(op_on_vals, lambda x: x**3), (op_across_columns, np.max)]}
# First pipe: Euclidean distance
# Second pipe: Mean of squares
# Third pipe: Maximum cube

df_list = []

for scheme in pipe_dict.keys():
    df_list.append(df.pipe(filter_columns, scheme))
    for (operation, func) in pipe_dict[scheme]:
        df_list[-1] = df_list[-1].pipe(operation, func)

print df_list[0]

0    1.256962
1    1.201048
2    0.803589
3    0.785843
4    0.755317

Getting the same result as above. 得到与上述相同的结果。

Now, this is just an example use and neither very elegant, nor computationally very efficient. 现在,这只是一个示例用法,既不是非常优雅,也不是计算效率很高。 It is just to demonstrate the concept of DataFrame pipelines. 只是为了演示DataFrame管道的概念。 Taking these concepts, you can go really fancy with this - for example defining pipelines of pipelines etc. 采取这些概念,您可以真正喜欢上它-例如定义管道的管道等。

However, taking this example you can achieve your goal of defining an arbitrary order of functions to be executed on your columns. 但是,以本示例为例,您可以实现定义要在列上执行的函数的任意顺序的目标。 You can now go one step further and apply one function at a time to specific columns, instead of applying functions across all columns. 现在,您可以更进一步,一次将一个函数应用于特定的列,而不是对所有列都应用函数。

For example, you can take my op_on_vals function and modify it so that it achieves what you outlined with row['A1']**2 , row['A2']**3 and then use .pipe(op_across_columns, np.sum) to implement what you sketched with 例如,您可以使用我的op_on_vals函数并对其进行修改,以使其达到您用row['A1']**2row['A2']**3 .pipe(op_across_columns, np.sum)然后使用.pipe(op_across_columns, np.sum)来实现您的素描

def function(row):
    return row['A1']**2 + row['A2']**3 + row['A3']**4

This shouldn't be too difficult, so I will leave the details of this implementation to you. 这应该不太困难,所以我将把这个实现的细节留给您。


Edit (2017-07-11) 编辑(2017-07-11)

Here is another piece of code that uses functools.partial in order to create 'function prototypes' of a power function. 这是另一段使用functools.partial的代码,以创建幂函数的“函数原型”。 These can be used to variably set an exponent for the power according to the number in the column names of the DataFrame. 这些可用于根据DataFrame的列名称中的数字可变地设置幂的指数。

This way we can use the numbers in A1 , A2 etc. to calculate value**1 , value**2 for each value in the corresponding column. 这样,我们可以使用A1A2等中的数字为对应列中的每个值计算value**1value**2 Finally, we can sum them in order to get what you sketched with 最后,我们可以对它们进行求和,以获取您的草图

row['A1']**2 + row['A2']**3 + row['A3']**4

You can find an excellent explanation of what functools.partial does on PyDanny's Blog . 您可以在PyDanny的Blog上找到functools.partial的出色解释。 Let's look at the code: 让我们看一下代码:

import pandas as pd
import numpy as np
import re

from functools import partial

def power(base, exponent):
    return base ** exponent

# Create example DataFrame.
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5, 9), columns=col_names)

# Separate 'letter''number' strings of columns into tuples of (letter, number).
match = re.findall(r"([A-Z]+)([0-9]+)", ''.join(df.columns.tolist()))

# Dictionary with 'prototype' functions for each column naming scheme.
func_dict = {'A': power, 'B': power, 'C': power}

# Initialize result columns with zeros.
for letter, _ in match:
    df[letter+'_result'] = np.zeros_like(df[letter+'1'])

# Apply functions to columns
for letter, number in match:
    col_name = ''.join([letter, number])
    teh_function = partial(func_dict[letter], exponent=int(number))
    df[letter+'_result'] += df[col_name].apply(teh_function)

print df

Output: 输出:

         A1        A2        A3        B1        B2        B3        C1  \
0  0.374540  0.950714  0.731994  0.598658  0.156019  0.155995  0.058084
1  0.708073  0.020584  0.969910  0.832443  0.212339  0.181825  0.183405
2  0.431945  0.291229  0.611853  0.139494  0.292145  0.366362  0.456070
3  0.514234  0.592415  0.046450  0.607545  0.170524  0.065052  0.948886
4  0.304614  0.097672  0.684233  0.440152  0.122038  0.495177  0.034389

         C2        C3  A_result  B_result  C_result
0  0.866176  0.601115  1.670611  0.626796  1.025551
1  0.304242  0.524756  1.620915  0.883542  0.420470
2  0.785176  0.199674  0.745815  0.274016  1.080532
3  0.965632  0.808397  0.865290  0.636899  2.409623
4  0.909320  0.258780  0.634494  0.576463  0.878582

You can replace the power functions in the func_dict with your own functions, for example one that sums the values with another value or performs some sort of fancy statistical calculations with them. 您可以将func_dictpower函数替换为您自己的函数,例如,将这些值与另一个值相加或对其执行某种形式的统计计算的函数。

Using this in combination with the pipeline approach from my earlier edit should give you the tools to get the results that you need. 将此与我之前的编辑中的管道方法结合使用,应该会为您提供获得所需结果的工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM