[英]Pass subset of df to function - Python
I'm manually passing specific values in a pandas df
to a function.我手动将 pandas df
中的特定值传递给函数。 This is fine but I'm hoping to make the process more efficient.这很好,但我希望使这个过程更有效率。 Specifically, I first subset all consecutive values in Item
.具体来说,我首先对Item
所有连续值进行子集化。 I then take the respective values in Val
and pass them to func
.然后我在Val
取各自的值并将它们传递给func
。 This produces the value I need.This is ok for smaller df's but become inefficient for larger datasets.这产生了我需要的值。这对于较小的 df 是可以的,但对于较大的数据集来说效率低下。
I'm just hoping to make this process more efficient to applying the values to the original df.我只是希望使这个过程更有效地将值应用于原始 df。
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'],
'Val' : [35,38,31,30,35,31,32,34,36,38,39,30,25,26,27],
'Item' : ['X','X','X','X','X','Y','Y','Y','Y','Y','Y','X','X','X','X'],
})
df1 = df.groupby([df['Item'].ne(df['Item'].shift()).cumsum(), 'Item']).size()
X1 = df[0:5]
Y1 = df[5:11]
X2 = df[11:15]
V1 = X1['Val1'].reset_index(drop = True)
V2 = Y1['Val1'].reset_index(drop = True)
V3 = X2['Val1'].reset_index(drop = True)
def func(U, m = 2, r = 0.2):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
print(func(V1))
print(func(V2))
print(func(V3))
out:出去:
0.287682072452
0.223143551314
0.405465108108
If I just try to apply the function using groupby
it returns KeyError: 0
.如果我只是尝试使用groupby
应用该函数,它会返回KeyError: 0
。 The function doesn't work unless I reset the index.除非我重置索引,否则该功能不起作用。
df1 = df.groupby(['Item']).apply(func)
KeyError: 0密钥错误:0
Intended Output:预期输出:
Time Val1 Item func
0 1 35 X 0.287
1 2 38 X 0.287
2 3 31 X 0.287
3 4 30 X 0.287
4 5 35 X 0.287
5 6 31 Y 0.223
6 7 32 Y 0.223
7 8 34 Y 0.223
8 9 36 Y 0.223
9 10 38 Y 0.223
10 11 39 Y 0.223
11 12 30 X 0.405
12 13 25 X 0.405
13 14 26 X 0.405
14 15 27 X 0.405
The issue is at U[j]
in the _phi
function.问题出在_phi
函数中的U[j]
_phi
。 Its j
is the positional index, so you may use U.iloc[j]
or change it to list and working straight from list.它的j
是位置索引,因此您可以使用U.iloc[j]
或将其更改为列表并直接从列表中工作。 It seems working on list faster than using iloc
.似乎比使用iloc
更快地处理iloc
。 My fix changes it to list and working on list.我的修复将其更改为列表并处理列表。 The line x = ...
in _phi
could also use a few modifications to make it shorter. _phi
的行x = ...
也可以使用一些修改使其更短。
Method 1 :方法一:
def func(U, m = 2, r = 0.2):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)] #change at this line
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
Create custom groupID s
as you did and groupby on s
and call transform
创建自定义组ID s
,你没有和GROUPBY上s
和呼叫transform
s = df['Item'].ne(df['Item'].shift()).cumsum()
df['func'] = df.groupby(s).Val.transform(func)
Out[1090]:
Time Val Item func
0 1 35 X 0.287682
1 2 38 X 0.287682
2 3 31 X 0.287682
3 4 30 X 0.287682
4 5 35 X 0.287682
5 6 31 Y 0.223144
6 7 32 Y 0.223144
7 8 34 Y 0.223144
8 9 36 Y 0.223144
9 10 38 Y 0.223144
10 11 39 Y 0.223144
11 12 30 X 0.405465
12 13 25 X 0.405465
13 14 26 X 0.405465
14 15 27 X 0.405465
Method 2 : It is shorter but less readable.方法 2 :它较短但可读性较差。 Use as_strided
from numpy.lib.stride_tricks
使用as_strided
的numpy.lib.stride_tricks
def func(U, m = 2, r = 0.2):
def _phi(m):
strd = U.to_numpy().strides[0]
x = as_strided(U.to_numpy(), (N-m+1, m), (strd, strd))
C = (np.abs(x - x[:,None]).max(-1) <= r).sum(-1) / (N - m + 1.0)
return np.sum(np.log(C)) / (N - m + 1.0)
N = len(U)
return abs(_phi(m + 1) - _phi(m))
You need to import as_strided
and create groupID and call groupby transform as method 1您需要导入as_strided
并创建as_strided
并调用 groupby 转换作为方法 1
from numpy.lib.stride_tricks import as_strided
s = df['Item'].ne(df['Item'].shift()).cumsum()
df['func'] = df.groupby(s).Val.transform(func)
It seems that your are using apply
with func
as is, but func
is not prepared to receive the whole slice of the dataframe directly.似乎您正在按原样apply
与func
一起使用,但func
不准备直接接收数据帧的整个切片。 In this cases, lambda expressions are useful.在这种情况下, lambda 表达式很有用。
You could do as follows:你可以这样做:
# Fisrt, convert each item (string) to a unique value (integer) (based on solution here: https://stackoverflow.com/questions/31701991/string-of-text-to-unique-integer-method)
df['ItemID'] = df['Item'].apply(lambda s: int.from_bytes(s.encode(), 'little'))
# Get the consecutive items (based on solution here: https://stackoverflow.com/questions/26911851/how-to-use-pandas-to-find-consecutive-same-data-in-time-series)
ItemConsecutive = (np.diff(df['ItemID'].values) != 0).astype(int).cumsum()
ItemConsecutive = np.insert(ItemConsecutive, 0, ItemConsecutive[0])
df['ItemConsecutive'] = ItemConsecutive
# Define your custom func (unmodified)
def func(U, m = 2, r = 0.2):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
# Get your calculated values with func based on each consecutive item
func_values = df.groupby('ItemConsecutive').apply(lambda x: func(x['Val'].reset_index(drop=True)))
func_values.name = 'func'
# Complete the dataframe with you calculated values
df = df.join(func_values, on='ItemConsecutive')
This is the result:这是结果:
Item Time Val ItemID ItemConsecutive func
0 X 1 35 88 0 0.287682
1 X 2 38 88 0 0.287682
2 X 3 31 88 0 0.287682
3 X 4 30 88 0 0.287682
4 X 5 35 88 0 0.287682
5 Y 6 31 89 1 0.223144
6 Y 7 32 89 1 0.223144
7 Y 8 34 89 1 0.223144
8 Y 9 36 89 1 0.223144
9 Y 10 38 89 1 0.223144
10 Y 11 39 89 1 0.223144
11 X 12 30 88 2 0.405465
12 X 13 25 88 2 0.405465
13 X 14 26 88 2 0.405465
14 X 15 27 88 2 0.405465
BTW, I'm using pandas version 0.23.3顺便说一句,我使用的是熊猫版本 0.23.3
One need to use apply after the groupby: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html在groupby之后需要使用apply: https ://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
df1 = df.groupby(['Item']).apply( lambda x : myfunc(x) )
myfunc operates on sub-dataframes which are grouped by 'Item'. myfunc 对按“项目”分组的子数据帧进行操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.