简体   繁体   中英

Arranging a dataframe into a list of dictionaries based on columns values

This is a follow-up to a question I previously asked regarding how to create dictionaries with specific columns however I realise now I need to be a bit more specifc in what I am asking. Take the below to be dataset.

     Name  Notes  Source            Number  Info
0    Bob    NaN     NaN              g:45   NaN
1  Billy    1.0    Home             B:+67   NaN
2  Billy    1.0    Work               B:3   NaN
3  Billy    NaN     NaN  hhtps://uishiufb   NaN
4  Billy    0.0  School             V9:67   NaN
5   Eric    0.0     NaN             R:+35   NaN
6   Eric    NaN    Home            f-g:35   NaN

I need to create a list of dictionaries based on the actual Numbers in the number column. The data is messy so take, for example, Billy. Rows 1 and 4 have the same numbers but different labels, 'B:+' and 'V9' respectively. I would like the code to ignore this and only focus on the numbers and if the numbers are the same create a unique dictionary based on that information as you can see below. Also, the data could contain something like a URL ie completely different and id like it to store that as a separate dictionary. Note - it doesn't need to necessarily be a URL it's just an example because the URL could contain numbers which could confuse the issue. The dataset is extensive and so the code to do this would ideally apply to all the rows in the Name column.

It isn't demonstrated here but some cells in the initial dataset go onto a newline so once I order the data frame into dictionaries and output it as an excel sheet, there are '\n's in the data which I would like to replace with commas. Any ideas for that would be very helpful.

Output:

dict1 = {Name: Bob, Source:[Nan], Number: [g:45]}
dict2 = {Name: Billy, Source:[Home, School], Number: [B:+67,V9:67]}
dict3 = {Name: Billy, Source:[Work], Number: [B:3}
dict4 = {Name: Billy, Source:[Nan], Number: [hhtps://uishiufb]}
dict5 = {Name: Eric, Source:[Home], Number: [R:+35, f-g:35]}

Thank you very much in advance

How does this work for you?

num = df['Number'].str.extract(':\+?(\d+)').fillna(df['Number']).squeeze()
df.groupby(['Name', num])[['Source', 'Number']].agg(list)\
  .droplevel(1).reset_index().to_dict('records')

Output:

[{'Name': 'Billy', 'Source': ['Work'], 'Number': ['B:3']},
 {'Name': 'Billy', 'Source': ['Home', 'School'], 'Number': ['B:+67', 'V9:67']},
 {'Name': 'Billy', 'Source': [nan], 'Number': ['hhtps://uishiufb']},
 {'Name': 'Bob', 'Source': [nan], 'Number': ['g:45']},
 {'Name': 'Eric', 'Source': [nan, 'Home'], 'Number': ['R:+35', 'f-g:35']}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM