简体   繁体   English

将 df 的子集传递给函数 - Python

[英]Pass subset of df to function - Python

I'm manually passing specific values in a pandas df to a function.我手动将 pandas df中的特定值传递给函数。 This is fine but I'm hoping to make the process more efficient.这很好,但我希望使这个过程更有效率。 Specifically, I first subset all consecutive values in Item .具体来说,我首先对Item所有连续值进行子集化。 I then take the respective values in Val and pass them to func .然后我在Val取各自的值并将它们传递给func This produces the value I need.This is ok for smaller df's but become inefficient for larger datasets.这产生了我需要的值。这对于较小的 df 是可以的,但对于较大的数据集来说效率低下。

I'm just hoping to make this process more efficient to applying the values to the original df.我只是希望使这个过程更有效地将值应用于原始 df。

import pandas as pd
import numpy as np

df = pd.DataFrame({ 
            'Time' : ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'],                   
            'Val' : [35,38,31,30,35,31,32,34,36,38,39,30,25,26,27],                   
            'Item' : ['X','X','X','X','X','Y','Y','Y','Y','Y','Y','X','X','X','X'],  
                    })

df1 = df.groupby([df['Item'].ne(df['Item'].shift()).cumsum(), 'Item']).size()

X1 = df[0:5]
Y1 = df[5:11]
X2 = df[11:15]

V1 = X1['Val1'].reset_index(drop = True)
V2 = Y1['Val1'].reset_index(drop = True)
V3 = X2['Val1'].reset_index(drop = True)

def func(U, m = 2, r = 0.2):

        def _maxdist(x_i, x_j):
            return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

        def _phi(m):
            x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
            C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
            return (N - m + 1.0)**(-1) * sum(np.log(C))

        N = len(U)

        return abs(_phi(m + 1) - _phi(m))

print(func(V1))
print(func(V2))
print(func(V3))

out:出去:

0.287682072452
0.223143551314
0.405465108108

If I just try to apply the function using groupby it returns KeyError: 0 .如果我只是尝试使用groupby应用该函数,它会返回KeyError: 0 The function doesn't work unless I reset the index.除非我重置索引,否则该功能不起作用。

df1 = df.groupby(['Item']).apply(func)

KeyError: 0密钥错误:0

Intended Output:预期输出:

   Time  Val1 Item   func
0     1    35    X  0.287
1     2    38    X  0.287
2     3    31    X  0.287
3     4    30    X  0.287
4     5    35    X  0.287
5     6    31    Y  0.223
6     7    32    Y  0.223
7     8    34    Y  0.223
8     9    36    Y  0.223
9    10    38    Y  0.223
10   11    39    Y  0.223
11   12    30    X  0.405
12   13    25    X  0.405
13   14    26    X  0.405
14   15    27    X  0.405

The issue is at U[j] in the _phi function.问题出在_phi函数中的U[j] _phi Its j is the positional index, so you may use U.iloc[j] or change it to list and working straight from list.它的j是位置索引,因此您可以使用U.iloc[j]或将其更改为列表并直接从列表中工作。 It seems working on list faster than using iloc .似乎比使用iloc更快地处理iloc My fix changes it to list and working on list.我的修复将其更改为列表并处理列表。 The line x = ... in _phi could also use a few modifications to make it shorter. _phi的行x = ...也可以使用一些修改使其更短。

Method 1 :方法一

def func(U, m = 2, r = 0.2):

    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [U.tolist()[i:i + m] for i in range(N - m + 1)] #change at this line
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)

    return abs(_phi(m + 1) - _phi(m))

Create custom groupID s as you did and groupby on s and call transform创建自定义组ID s ,你没有和GROUPBY上s和呼叫transform

s = df['Item'].ne(df['Item'].shift()).cumsum()
df['func'] = df.groupby(s).Val.transform(func)

Out[1090]:
   Time  Val Item      func
0     1   35    X  0.287682
1     2   38    X  0.287682
2     3   31    X  0.287682
3     4   30    X  0.287682
4     5   35    X  0.287682
5     6   31    Y  0.223144
6     7   32    Y  0.223144
7     8   34    Y  0.223144
8     9   36    Y  0.223144
9    10   38    Y  0.223144
10   11   39    Y  0.223144
11   12   30    X  0.405465
12   13   25    X  0.405465
13   14   26    X  0.405465
14   15   27    X  0.405465

Method 2 : It is shorter but less readable.方法 2 :它较短但可读性较差。 Use as_strided from numpy.lib.stride_tricks使用as_stridednumpy.lib.stride_tricks

def func(U, m = 2, r = 0.2):

    def _phi(m):
        strd = U.to_numpy().strides[0]
        x = as_strided(U.to_numpy(), (N-m+1, m), (strd, strd))
        C = (np.abs(x - x[:,None]).max(-1) <= r).sum(-1) / (N - m + 1.0)    
        return np.sum(np.log(C)) / (N - m + 1.0)

    N = len(U)

    return abs(_phi(m + 1) - _phi(m))      

You need to import as_strided and create groupID and call groupby transform as method 1您需要导入as_strided并创建as_strided并调用 groupby 转换作为方法 1

from numpy.lib.stride_tricks import as_strided

s = df['Item'].ne(df['Item'].shift()).cumsum()
df['func'] = df.groupby(s).Val.transform(func)

It seems that your are using apply with func as is, but func is not prepared to receive the whole slice of the dataframe directly.似乎您正在按原样applyfunc一起使用,但func不准备直接接收数据帧的整个切片。 In this cases, lambda expressions are useful.在这种情况下, lambda 表达式很有用。

You could do as follows:你可以这样做:

# Fisrt, convert each item (string) to a unique value (integer) (based on solution here: https://stackoverflow.com/questions/31701991/string-of-text-to-unique-integer-method)
df['ItemID'] = df['Item'].apply(lambda s: int.from_bytes(s.encode(), 'little'))

# Get the consecutive items (based on solution here: https://stackoverflow.com/questions/26911851/how-to-use-pandas-to-find-consecutive-same-data-in-time-series)
ItemConsecutive = (np.diff(df['ItemID'].values) != 0).astype(int).cumsum()
ItemConsecutive = np.insert(ItemConsecutive, 0, ItemConsecutive[0])
df['ItemConsecutive'] = ItemConsecutive

# Define your custom func (unmodified)
def func(U, m = 2, r = 0.2):
    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))
    N = len(U)
    return abs(_phi(m + 1) - _phi(m))

# Get your calculated values with func based on each consecutive item
func_values = df.groupby('ItemConsecutive').apply(lambda x: func(x['Val'].reset_index(drop=True)))
func_values.name = 'func'

# Complete the dataframe with you calculated values
df = df.join(func_values, on='ItemConsecutive')

This is the result:这是结果:

   Item Time  Val  ItemID  ItemConsecutive      func
0     X    1   35      88                0  0.287682
1     X    2   38      88                0  0.287682
2     X    3   31      88                0  0.287682
3     X    4   30      88                0  0.287682
4     X    5   35      88                0  0.287682
5     Y    6   31      89                1  0.223144
6     Y    7   32      89                1  0.223144
7     Y    8   34      89                1  0.223144
8     Y    9   36      89                1  0.223144
9     Y   10   38      89                1  0.223144
10    Y   11   39      89                1  0.223144
11    X   12   30      88                2  0.405465
12    X   13   25      88                2  0.405465
13    X   14   26      88                2  0.405465
14    X   15   27      88                2  0.405465

BTW, I'm using pandas version 0.23.3顺便说一句,我使用的是熊猫版本 0.23.3

One need to use apply after the groupby: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html在groupby之后需要使用apply: https ://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

df1 = df.groupby(['Item']).apply( lambda x : myfunc(x) )

myfunc operates on sub-dataframes which are grouped by 'Item'. myfunc 对按“项目”分组的子数据帧进行操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM