简体   繁体   中英

Pandas dataframe- create new list column consisting of aggregation of strings from grouped column

I've been struggling with this one a bit and am feeling a bit stuck.

I have a dataframe consisting of data like this, named merged_frames (it is a single frame, created by concatenating a handful of frames with the same shape):

          fqdn               source
0         site1.org          public_source_a
1         site2.org          public_source_a
2         site3.org          public_source_a
3         site1.org          public_source_b
4         site4.org          public_source_b
5         site1.org          public_source_b
6         site4.org          public_source_d                                 ...                 
7         site1.org          public_source_c
...

What I am trying to do is create a new column in this frame that contains a list (ideally a Python list as opposed to a command delimited string) of the sources when grouping by the fqdn value. For example, the data produced for the fqdn value site1.org should look like this based on this example data (this is just a subset of what I would expect, there should also be rows for the other fqdn values as well)

fqdn        source_list                                           source
site1.org   [public_source_a, public_source_b, public_source_c]   public_source_a
site1.org   [public_source_a, public_source_b, public_source_c]   public_source_b
site1.org   [public_source_a, public_source_b, public_source_c]   public_source_c
site1.org   [public_source_a, public_source_b, public_source_c]   public_source_d

Once I have the data in this form, I will simply drop the source column and then use drop_duplicates(keep='first') to get rid of all but one.

I dug up some old code that I used to do something similar about 2 years ago and it is not working as I expected it to. It's been quite a while since I've done something like this with Pandas. What I had was along the lines of:

    merged_frame['source_list'] = merged_frame.groupby(
        'fqdn', as_index=False)[['source']].aggregate(
            lambda x: list(x))['source']

This is behaving very strangely. While it is in fact creating source_list as a list/array, the data in the column is not correct. Additionally, quite a few fqdn values have a null/NaN value for source_list

I have a feeling that this I need to approach this completely different. A little help with this would be appreciated, I'm completely blocked now and am not making any progress with it, despite having what I thought were very relevant example blocks of code I used on a similar dataset.

EDIT:

I have made a little progress by just starting with the fundamentals and have the following, though this joins the strings together rather than making them a list:

    merged_frame['source_list'] = merged_frame.groupby('fqdn').source.transform(','.join)

I'm pretty sure with a simply apply here I can split them back into a list . But what would be the correct way to do this in one shot so that I don't need to do the unnecessary join and then apply(split(',')) ?

Create the data frame from the example above:

df=pd.DataFrame({'fqdn':['site1.org','site2.org','site3.org','site1.org','site4.org','site1.org','site4.org','site1.org'],\
                 'source':['public_source_a','public_source_a','public_source_a','public_source_b','public_source_b','public_source_b',\
                 'public_source_d','public_source_c']})

Use groupby and apply(list)

df_grouped=df.groupby('fqdn')['source'].unique().apply(list).reset_index()

Merge with original df and rename columns

result=pd.merge(df,df_grouped,on='fqdn',how='left')
result.rename(columns={'source_x':'source','source_y':'source_list'},inplace=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM