I've been struggling with this one a bit and am feeling a bit stuck.
I have a dataframe consisting of data like this, named merged_frames
(it is a single frame, created by concatenating a handful of frames with the same shape):
fqdn source
0 site1.org public_source_a
1 site2.org public_source_a
2 site3.org public_source_a
3 site1.org public_source_b
4 site4.org public_source_b
5 site1.org public_source_b
6 site4.org public_source_d ...
7 site1.org public_source_c
...
What I am trying to do is create a new column in this frame that contains a list (ideally a Python list as opposed to a command delimited string) of the sources when grouping by the fqdn
value. For example, the data produced for the fqdn
value site1.org
should look like this based on this example data (this is just a subset of what I would expect, there should also be rows for the other fqdn
values as well)
fqdn source_list source
site1.org [public_source_a, public_source_b, public_source_c] public_source_a
site1.org [public_source_a, public_source_b, public_source_c] public_source_b
site1.org [public_source_a, public_source_b, public_source_c] public_source_c
site1.org [public_source_a, public_source_b, public_source_c] public_source_d
Once I have the data in this form, I will simply drop the source
column and then use drop_duplicates(keep='first')
to get rid of all but one.
I dug up some old code that I used to do something similar about 2 years ago and it is not working as I expected it to. It's been quite a while since I've done something like this with Pandas. What I had was along the lines of:
merged_frame['source_list'] = merged_frame.groupby(
'fqdn', as_index=False)[['source']].aggregate(
lambda x: list(x))['source']
This is behaving very strangely. While it is in fact creating source_list
as a list/array, the data in the column is not correct. Additionally, quite a few fqdn
values have a null/NaN value for source_list
I have a feeling that this I need to approach this completely different. A little help with this would be appreciated, I'm completely blocked now and am not making any progress with it, despite having what I thought were very relevant example blocks of code I used on a similar dataset.
EDIT:
I have made a little progress by just starting with the fundamentals and have the following, though this joins the strings together rather than making them a list:
merged_frame['source_list'] = merged_frame.groupby('fqdn').source.transform(','.join)
I'm pretty sure with a simply apply
here I can split them back into a list
. But what would be the correct way to do this in one shot so that I don't need to do the unnecessary join
and then apply(split(','))
?
Create the data frame from the example above:
df=pd.DataFrame({'fqdn':['site1.org','site2.org','site3.org','site1.org','site4.org','site1.org','site4.org','site1.org'],\
'source':['public_source_a','public_source_a','public_source_a','public_source_b','public_source_b','public_source_b',\
'public_source_d','public_source_c']})
Use groupby and apply(list)
df_grouped=df.groupby('fqdn')['source'].unique().apply(list).reset_index()
Merge with original df and rename columns
result=pd.merge(df,df_grouped,on='fqdn',how='left')
result.rename(columns={'source_x':'source','source_y':'source_list'},inplace=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.