简体   繁体   English

加入两个数据框,然后使用 Pandas 组合具有相同名称的字段中的数据

[英]Joining two dataframes then combining data in fields with same name using Pandas

I have seven dataframes with hundreds of rows each(don't ask) that I need to combine on a column.我有七个数据框,每个数据框有数百行(不要问),我需要在列上组合它们。 I know how to use the inner join functionality.我知道如何使用内部连接功能。 In Pandas.在熊猫。 What I need help with is that there are instances where these seven data frames have columns with the same names.我需要帮助的是,在某些情况下,这七个数据框具有相同名称的列。 In those instances, I would like to combine the data therein and delimit with a semicolon.在这些情况下,我想合并其中的数据并用分号分隔。

For example, if Row 1 in DF1 through DF7 have the same identifier, I would like Col1 in each dataframe (given they have the same name) to be combined to read:例如,如果 DF1 到 DF7 中的第 1 行具有相同的标识符,我希望将每个数据帧中的 Col1(假设它们具有相同的名称)组合起来以读取:

dfdata1; dfdata1; dfdata2; dfdata2; ...;dfdata7 ...;dfdata7

In cases where a column name is unique, I'd like it to appear in the final combined dataframe.如果列名是唯一的,我希望它出现在最终的组合数据框中。

I've included a simple example我已经包含了一个简单的例子

import pandas as pd

data1 = pd.DataFrame([['Banana', 'Sally', 'CA'], ['Apple', 'Gretta', 'MN'], ['Orange', 'Samantha', 'NV']],
                     columns=['Product', 'Cashier', 'State'])
  

data2 = pd.DataFrame([['Shirt','', 'CA'], ['Shoe', 'Trish', 'MN'], ['Socks', 'Paula', 'NM', 'Hourly']],

This yields two dataframes:这会产生两个数据框:

在此处输入图像描述

If we were to use an outer merge on state:如果我们要对状态使用外部合并:

pd.merge(data1,data2,on='State',how='outer')

在此处输入图像描述

What I want is something more like this:我想要的是更像这样的东西:

在此处输入图像描述

Is this doable in pandas or will I have to merge the first two, combine the columns with the same names, then move on to combine THAT with the third one etc. I'm trying to be as efficient as possible.这在熊猫中是否可行,还是我必须合并前两个,合并具有相同名称的列,然后继续将其与第三个等结合起来。我正在努力提高效率。

Instead of merging, concatenate而不是合并,连接

# concatenate and groupby to join the strings
df = pd.concat([data1, data2]).groupby('State', as_index=False).agg(lambda x: '; '.join(el for el in x if pd.notna(el)))
print(df)
  State        Product        Cashier    Type
0    CA  Banana; Shirt        Sally;         
1    MN    Apple; Shoe  Gretta; Trish        
2    NM          Socks          Paula  Hourly
3    NV         Orange       Samantha        

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM