简体   繁体   English

合并来自两个数据帧的两列; 索引相同但长度不同

[英]Combining two columns from two dataframes; same indices but different lengths

Please be advised, I am a beginning programmer and a beginning python/pandas user. 请注意,我是一名初级程序员和python / pandas初级用户。 I'm a behavioral scientist and learning to use pandas to process and organize my data. 我是一名行为科学家,正在学习使用熊猫来处理和组织我的数据。 As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. 结果,其中一些看起来似乎是完全显而易见的,并且似乎是一个不值得论坛讨论的问题。 Please have tolerance! 请宽容! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. 对我来说,这是几天的工作,我确实已经花费了数小时试图找出这个问题的答案。 Thanks in advance for any help. 在此先感谢您的帮助。

My data look like this. 我的数据看起来像这样。 The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. “实际” Actor和收件人数据始终为5位数字,而“行为”数据始终为字母代码。 My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. 我的问题是我也将这种格式用于特殊行,在Actor列中用诸如“ date”或“ s”之类的标记表示。 These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. 这些标记表明“行为”列包含此特殊类型的数据,而不是实际的行为数据。 So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column). 因此,我想用NaN值替换Actor列中的标记,并从behavior列中获取特殊数据以放入另一列(在本示例中为空的Activity列)。

    follow    Activity    Actor    Behavior    Recipient1
0   1         NaN         date     2.1.3.2012  NaN
1   1         NaN         s        ss.hx       NaN
2   1         NaN         50505    vo          51608
3   1         NaN         51608    vr          50505
4   1         NaN         s        ss.he       NaN

So far, I have written some code in pandas to select out the "s" lines into a new dataframe: 到目前为止,我已经在熊猫中编写了一些代码,以将“ s”行选择到新的数据框中:

def get_act_line(group):
    return group.ix[(group.Actor == 's')]

result = trimdata.groupby('follow').apply(get_act_line)

I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN: 我已经将此数据帧中的“行为”列复制到“活动”列,并用NaN替换了Actor和Behavior值:

result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()

So my new dataframe looks like this: 所以我的新数据框看起来像这样:

follow         follow    Activity    Actor    Behavior    Recipient1
1        2     1         ss.hx       NaN      NaN         NaN
         34    1         hf.xa       NaN      NaN         f.53702
         74    1         hf.fe       NaN      NaN         NaN
10       1287  10        ss.hf       NaN      NaN         db
         1335  10        fe          NaN      NaN         db

What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe. 我现在想做的是将此数据框与原始数据合并,替换这些选定行中的所有值,但保留原始数据框中其他行的值。

This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with! 这似乎是一个简单的问题,有一个明显的解决方案,或者也许我一开始就错了!

I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. 我阅读过Wes McKinney的书,阅读过有关不同类型的合并,映射,联接,转换,串联等的文档。我浏览了论坛,但没有找到帮助我解决这一问题的答案。 Your help will be very much appreciated. 非常感谢您的帮助。

One way you can do this (though there may be more optimal or elegant ways) is: 您可以执行此操作的一种方法(尽管可能有更优化或更优雅的方法):

mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan 

where df is equivalent to your results dataframe. 其中df等于结果数据框。 This should return (my column orders are slightly different): 这应该返回(我的列顺序略有不同):

  Activity  Actor             Behavior  Recipient1  follow
0      NaN   date  2013-04-01 00:00:00          NaN       1
1    ss.hx    NaN                ss.hx          NaN       1
2      NaN  50505                   vo        51608       1
3      NaN  51608                   vr        50505       1
4    ss.he    NaN                ss.hx          NaN       1

References: 参考文献:

  • Explanation of df.ix from other STO post. 其他STO帖子中df.ix的说明。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM