[英]Apply a function pairwise on a pandas series
我有一個熊貓系列,其元素構成了Frozensets:
data = {0: frozenset({'apple', 'banana'}),
1: frozenset({'apple', 'orange'}),
2: frozenset({'banana'}),
3: frozenset({'kumquat', 'orange'}),
4: frozenset({'orange'}),
5: frozenset({'orange', 'pear'}),
6: frozenset({'orange', 'pear'}),
7: frozenset({'apple', 'banana', 'pear'}),
8: frozenset({'banana', 'persimmon'}),
9: frozenset({'apple'}),
10: frozenset({'banana'}),
11: frozenset({'apple'})}
tokens = pd.Series(data); tokens
0 (apple, banana)
1 (orange, apple)
2 (banana)
3 (orange, kumquat)
4 (orange)
5 (orange, pear)
6 (orange, pear)
7 (apple, banana, pear)
8 (persimmon, banana)
9 (apple)
10 (banana)
11 (apple)
Name: Tokens, dtype: object
我想成對應用一個函數。 例如, tokens.diff
給我連續行之間的設置差:
0 NaN
1 (orange)
2 (banana)
3 (orange, kumquat)
4 ()
5 (pear)
6 ()
7 (apple, banana)
8 (persimmon)
9 (apple)
10 (banana)
11 (apple)
Name: Tokens, dtype: object
我想要相同的東西,但我希望在連續的行上設置集合並集,而不是設置差異。 因此,我理想地希望:
0 NaN
1 (orange, apple, banana)
2 (banana, orange, apply)
3 (orange, kumquat, banana)
4 (orange, kumquat)
...
如何使用Pandas實現這一目標? 我知道我可以使用zip
和list comp來做到這一點,但希望有更好的方法。
幾種方法
選項1]清單理解
In [3631]: pd.Series([x[0].union(x[1])
for x in zip(tokens, tokens.shift(-1).fillna(''))],
index=tokens.index)
Out[3631]:
0 (orange, banana, apple)
1 (orange, apple, banana)
2 (orange, kumquat, banana)
3 (orange, kumquat)
4 (orange, pear)
5 (orange, pear)
6 (orange, pear, banana, apple)
7 (persimmon, pear, banana, apple)
8 (apple, persimmon, banana)
9 (apple, banana)
10 (banana, apple)
11 (apple)
dtype: object
選項2] map
In [3632]: pd.Series(map(lambda x: x[0].union(x[1]),
zip(tokens, tokens.shift(-1).fillna(''))),
index=tokens.index)
Out[3632]:
0 (orange, banana, apple)
1 (orange, apple, banana)
2 (orange, kumquat, banana)
3 (orange, kumquat)
4 (orange, pear)
5 (orange, pear)
6 (orange, pear, banana, apple)
7 (persimmon, pear, banana, apple)
8 (apple, persimmon, banana)
9 (apple, banana)
10 (banana, apple)
11 (apple)
dtype: object
選項3]使用concat
並apply
In [3633]: pd.concat([tokens, tokens.shift(-1).fillna('')],
axis=1).apply(lambda x: x[0].union(x[1]), axis=1)
Out[3633]:
0 (orange, banana, apple)
1 (orange, apple, banana)
2 (orange, kumquat, banana)
3 (orange, kumquat)
4 (orange, pear)
5 (orange, pear)
6 (orange, pear, banana, apple)
7 (persimmon, pear, banana, apple)
8 (apple, persimmon, banana)
9 (apple, banana)
10 (banana, apple)
11 (apple)
dtype: object
計時
In [3647]: tokens.shape
Out[3647]: (60000L,)
In [3648]: %timeit pd.Series([x[0].union(x[1]) for x in zip(tokens, tokens.shift(-1).fillna(''))], index=tokens.index)
10 loops, best of 3: 35 ms per loop
In [3649]: %timeit pd.Series(map(lambda x: x[0].union(x[1]), zip(tokens, tokens.shift(-1).fillna(''))), index=tokens.index)
10 loops, best of 3: 40.9 ms per loop
In [3650]: %timeit pd.concat([tokens, tokens.shift(-1).fillna('')], axis=1).apply(lambda x: x[0].union(x[1]), axis=1)
1 loop, best of 3: 2.2 s per loop
不相關並且為了diff
In [3653]: %timeit tokens.diff()
10 loops, best of 3: 10.8 ms per loop
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.