简体   繁体   English

将系列作为新列添加到 pandas dataframe 时缺少行

[英]Missing rows when adding a series as a new column to a pandas dataframe

I have two series, s1 and s2, defined through:我有两个系列,s1 和 s2,通过以下方式定义:

s1 = pd.Series({1: 10, 2: 20, 3: 30, 4: 40}, name='s1')
s2 = pd.Series({3: 35, 4: 45, 5: 55, 6: 65}, name='s2')

They look like this:它们看起来像这样:

1    10
2    20
3    30
4    40
Name: s1, dtype: int64

3    35
4    45
5    55
6    65
Name: s2, dtype: int64

I am trying to create a dataframe that will have s1 and s2 as two columns, with the index a combination of the two series indexes.我正在尝试创建一个 dataframe ,它将 s1 和 s2 作为两列,索引是两个系列索引的组合。 But a simple assignment doesn't work:但是一个简单的任务不起作用:

df = pd.DataFrame()
df['s1'] = s1
df['s2'] = s2

The resulting dataframe has the index from s1 but misses the rows from s2 for which the index is not in s1:生成的 dataframe 具有来自 s1 的索引,但错过了来自 s2 的索引不在 s1 中的行:

   s1    s2
1  10   NaN
2  20   NaN
3  30  35.0
4  40  45.0

Why is that?这是为什么? It seems somewhat counter-intuitive.这似乎有点违反直觉。

Note - a proposed solution is:注意 - 建议的解决方案是:

df = pd.concat((s1, s2), axis=1)

which gives the expected result:这给出了预期的结果:

     s1    s2
1  10.0   NaN
2  20.0   NaN
3  30.0  35.0
4  40.0  45.0
5   NaN  55.0
6   NaN  65.0

But I am nevertheless curious why a simple column assignment doesn't work.但我仍然很好奇为什么简单的列分配不起作用。

Its because its matching on index, and s2 starts at 3这是因为它在索引上匹配,并且 s2 从 3 开始


df = pd.DataFrame()
df['s1'] = s1
df['s2'] = s2

This is setting the shape based on s1, then matching s2 data to the s1 df shape.这是根据 s1 设置形状,然后将 s2 数据匹配到 s1 df 形状。


df with s1
   s1
1  10
2  20
3  30
4  40

df with s2
   s2
3  35
4  45
5  55
6  65

This is because after you assign a Series to a column of an empty dataframe, the index of dataframe is created align with the Series.这是因为在将 Series 分配给空的 dataframe 的列之后,dataframe 的索引将与 Series 对齐。

Then you assign another Series to a new column of the one-column dataframe, only indexes in dataframe will be considered.然后将另一个系列分配给单列 dataframe 的新列,仅考虑 dataframe 中的索引。

If you try to assign s2 before s1如果您尝试在s1之前分配s2

   s2    s1
3  35  30.0
4  45  40.0
5  55   NaN
6  65   NaN
print(df)

   s2    s1
3  35  30.0
4  45  40.0
5  55   NaN
6  65   NaN

Since the 2 series have different indices, assigning either series before the other (whether s1 before s2 or s2 before s1) to the empty dataframe would still cause you missing rows.由于 2 个系列具有不同的索引,因此在另一个系列之前(无论是 s1 在 s2 之前还是 s2 在 s1 之前)分配给空的 dataframe 仍然会导致您丢失行。 This is because the dataframe index of an empty dataframe is automatically set to that of the first series assigned to it.这是因为空 dataframe 的 dataframe 索引自动设置为分配给它的第一个系列的索引。 As a result, when the second series is assigned to the dataframe, it will just take the rows aligning with its current index (just recently set to the index of s1) and ignore rows from the remaining portion of s2 index not common with s1.因此,当将第二个系列分配给 dataframe 时,它将只获取与其当前索引对齐的行(最近设置为 s1 的索引),并忽略 s2 索引的其余部分中与 s1 不常见的行。

There is one remedy to make the 2 statements assigning s1 and s2 to df working as you expect:有一种补救措施可以使将s1s2分配给df的 2 个语句按您的预期工作:

df = pd.DataFrame(index=s1.index.union(s2.index))

By presetting the dataframe index to be the union of s1.index and s2.index s1.index.union(s2.index) , you will get your desired result:通过将 dataframe 索引预设为 s1.index 和 s2.index s1.index.union(s2.index)的并集,您将获得所需的结果:

df['s1'] = s1
df['s2'] = s2


print(df)

     s1    s2
1  10.0   NaN
2  20.0   NaN
3  30.0  35.0
4  40.0  45.0
5   NaN  55.0
6   NaN  65.0

Breaking down the intermediate steps, you will see interesting result:分解中间步骤,你会看到有趣的结果:

df = pd.DataFrame(index=s1.index.union(s2.index))
df['s1'] = s1


print(df)

     s1
1  10.0
2  20.0
3  30.0
4  40.0
5   NaN
6   NaN   

Here, before assigning s2, you can still see index 5 6 (which is part of s2 only) after assigning only s1 and before assigning s2.在这里,在分配 s2 之前,您仍然可以在仅分配 s1 和分配 s2 之前看到索引5 6 (它只是 s2 的一部分)。 The corresponding values for index 5 6 are NaN .索引5 6的对应值为NaN This is because we have already defined the empty dataframe df with index being union of s1 and s2 while s2 has still not yet assigned to it.这是因为我们已经定义了空的 dataframe df ,索引是 s1 和 s2 的并集,而 s2 尚未分配给它。

If you want to only modify the dataframe index on the fly after s1 has been assigned to the empty dataframe which has not set with the index= parameter, you can do it by df.reindex() , as follows:如果您只想在将 s1 分配给未使用index=参数设置的空 dataframe 后即时修改 dataframe 索引,则可以通过df.reindex()进行,如下所示:

df = pd.DataFrame()                            # without the index= parameter
df['s1'] = s1
df = df.reindex(s1.index.union(s2.index))      # Use reindex()



print(df)

     s1    s2
1  10.0   NaN
2  20.0   NaN
3  30.0  35.0
4  40.0  45.0
5   NaN  55.0
6   NaN  65.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM