简体   繁体   English

有效地连接多个熊猫系列

[英]Concatenate multiple pandas series efficiently

I understand that I can use combine_first to merge two series: 我知道我可以使用combine_first合并两个系列:

series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([1,2,3,4,5],index=['f','g','h','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['k','l','m','n','o'])

Combine1 = series1.combine_first(series2)
print(Combine1

Output: 输出:

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    1.0
g    2.0
h    3.0
i    4.0
j    5.0
dtype: float64

What if I need to merge 3 or more series? 如果我需要合并3个或更多系列怎么办?

I understand that using the following code: print(series1 + series2 + series3) yields: 我理解使用以下代码: print(series1 + series2 + series3)产生:

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
...
dtype: float64

Can I merge multiple series efficiently without using combine_first multiple times? 如果不多次使用combine_first我可以有效地合并多个系列吗?

Thanks 谢谢

Combine Series with Non-Overlapping Indexes 将系列与非重叠索引相结合

To combine series vertically, use pd.concat . 要垂直组合系列,请使用pd.concat

# Setup
series_list = [
    pd.Series(range(1, 6), index=list('abcde')),
    pd.Series(range(1, 6), index=list('fghij')),
    pd.Series(range(1, 6), index=list('klmno'))
]

pd.concat(series_list)

a    1
b    2
c    3
d    4
e    5
f    1
g    2
h    3
i    4
j    5
k    1
l    2
m    3
n    4
o    5
dtype: int64

Combine with Overlapping Indexes 结合重叠索引

series_list = [
    pd.Series(range(1, 6), index=list('abcde')),
    pd.Series(range(1, 6), index=list('abcde')),
    pd.Series(range(1, 6), index=list('kbmdf'))
]

If the Series have overlapping indices, you can either combine (add) the keys, 如果系列具有重叠索引,您可以组合(添加)键,

pd.concat(series_list, axis=1, sort=False).sum(axis=1)

a     2.0
b     6.0
c     6.0
d    12.0
e    10.0
k     1.0
m     3.0
f     5.0
dtype: float64

Alternatively, just drop duplicates values on the index if you want to take only the first/last value (when there are duplicates). 或者,如果您只想获取第一个/最后一个值(当存在重复项时),则只删除索引上的重复值。

res = pd.concat(series_list, axis=0)
# keep first value
res[~res.index.duplicated(keep='first')]
# keep last value
res[~res.index.duplicated(keep='last')]

Presuming that you were using the behavior of combine_first to prioritize the values of the series in order as combine_first is meant for, you could succinctly make multiple calls to it with a lambda expression. 假设您使用combine_first的行为来按combine_first优先处理系列的值,就像combine_first ,您可以使用lambda表达式简洁地对其进行多次调用。

from functools import reduce
l_series = [series1, series2, series3]
reduce(lambda s1, s2: s1.combine_first(s2), l_series)

Of course if the indices are unique as in your current example, you can simply use pd.concat instead. 当然,如果索引在当前示例中是唯一的,则可以简单地使用pd.concat

Demo 演示

series1 = pd.Series(list(range(5)),index=['a','b','c','d','e'])
series2 = pd.Series(list(range(5, 10)),index=['a','g','h','i','j'])
series3 = pd.Series(list(range(10, 15)),index=['k','b','m','c','o'])

from functools import reduce
l_series = [series1, series2, series3]
print(reduce(lambda s1, s2: s1.combine_first(s2), l_series))

# a     0.0
# b     1.0
# c     2.0
# d     3.0
# e     4.0
# g     6.0
# h     7.0
# i     8.0
# j     9.0
# k    10.0
# m    12.0
# o    14.0
# dtype: float64

Agree with what @codespeed has pointed out in his answer. 同意@codespeed在答案中指出的内容。

I think it will depend on user needs. 我认为这将取决于用户的需求。 If series index are confirmed with no overlapping, concat will be a better option. 如果确认系列索引没有重叠,则concat将是更好的选择。 (as original question posted, there is no index overlapping, then concat will be a better option) (作为原始问题发布,没有索引重叠,那么concat将是更好的选择)

If there is index overlapping, you might need to consider how to handle overlapping, which value to be overwritten. 如果存在索引重叠,则可能需要考虑如何处理重叠,要覆盖哪个值。 (as example provided by codespeed, if index are matching to different values, need to be careful about combine_first) (作为代码提供的示例,如果索引匹配不同的值,则需要注意combine_first)

ie (note series3 is same as series1, series2 is same as series4) 即(注意series3与series1相同,series2与series4相同)

import pandas as pd
import numpy as np


series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series4 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])


print(series1.combine_first(series2))



a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
i    4.0
j    5.0
dtype: float64



print(series4.combine_first(series3))



a    2.0
b    3.0
c    4.0
d    4.0
e    5.0
i    4.0
j    5.0
dtype: float64

You would use combine_first if you want one series's values prioritized over the other. 如果您希望将一个系列的值优先于另一个系列的值,则可以使用combine_first。 Its usually used to fill the missing values in the first series. 它通常用于填充第一个系列中的缺失值。 I am not sure whats the expected output in your example but looks like you can use concat 我不确定你的例子中的预期输出是什么,但看起来你可以使用concat

pd.concat([series1, series2, series3])

You get 你得到

a    1
b    2
c    3
d    4
e    5
f    1
g    2
h    3
i    4
j    5
k    1
l    2
m    3
n    4
o    5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM