简体   繁体   中英

Python join two dataframes with different sizes

I'm trying to join two different dataframes. I'll explain what I did so far so you'll understand what I have tried. I'm kinda new to python and I'd really appreciate every hint where i can improve my code.

I've got a dataset which looks similiar to this:

cluster, Type
      1,    M
      1,    T
      1,    M

I've grouped the data and did some aggregation. In addition to this I added some columns to the dataset. So my dataframe is looking like this now:

>>> df
cluster, Type, M, T
      1,    M, 0, 0
      1,    T, 0, 0
      1,    M, 0, 0

And the aggregation looks like this:

>>> a
cluster  Type, len
      1,    M,   2
      1,    T,   1

I want to put ever len from a to the corresponding column in df so the result would be:

>>> df
cluster, Type, M, T
      1,    M, 2, 0
      1,    T, 0, 1

What I've tried to do is:

for idx, row in df.iterrows():
    c = row['cluster']
    t = row['Type']
    val = a.loc[
        (a['cluster'] == c) &
        (a['Type'] == t),
        'len'
    ]
    row[t] = val

In the end, it failed because the last line, row[t] didn't get updated. But I have the feeling I'm doing this in a very complicated way.

Any ideas how to do it in an more elegant way?

You can use this to go from 'a' to your expected result using set_index , unstack and reset_index :

df = a.set_index([a.Type,'cluster','Type'])['len']\
      .unstack(0).rename_axis(None,axis=1)\
      .reset_index()

Output:

   cluster Type    M    T
0        1    M  2.0  NaN
1        1    T  NaN  1.0

Here is a way to do it. It still involves a loop, but I think it's clearer and faster than what you were trying to do. It only uses your original df , no need for the aggregation you provided.

Start by making a dictionary of the length per Type :

len_dict = df.groupby('Type').size().to_dict()
>>> len_dict
{'M': 2, 'T': 1}

Then drop the duplicates in your original df , finally looping through the keys in len_dict and assigning the approriate columns to the respective keys:

df.drop_duplicates(inplace=True)

for t in len_dict:
    df.loc[df.Type.eq(t), t] = len_dict[t]

>>> df
   cluster Type  M  T
0        1    M  2  0
1        1    T  0  1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM