简体   繁体   中英

Sort pandas data frame based on values assigned in separate dictionary

I have excel tables of biographic events ranging from birth to death. In between, there are all kinds of professional and personal events, such as university graduation, marriage, or retirement. The preferred course of action would be to sort those by date for each person in the data set, but as these are historical data, we have no exact dates for most events.

My plan therefore is to assign a chronological value from "1" for birth to "100" for deatch to each event and rank other events higher or lower in between to be able to re-construct fuzzy timelines.

For the event coding, I have created a dictionary:


event_value_dict={"Sonstiges":0, 
                  "Geburt":1, 
                  "Taufe":2, 
                  "Primäre Bildungsstation":3, 
                  "Privatunterricht":3,
                  "Rezeption":4, # nicht sicher ob bezogen auf Studium?
                  "Zulassung":9, # vor dem Studium, oder z.B. auch zur Prüfung?
                  "Immatrikulation":10,
                  "Studium":11,
                  "Prüfungsverfahren":11,
                  "Graduation":12,
                  "Praktikum":13,
                  "Promotion":14,
                  "Wohnsitznahme": 20,
                  "Reise":20, # Events mit Code "20" können in der Lebensmitte mehrfach auftreten
                  "Nobilitierung":20,
                  "Aufnahme":20,
                  "Aufschwörung":20,
                  "Eheschließung":20,
                  "Funktionsausübung":20,
                  "erfolglose Bewerbung":20,
                  "Rejektion":20,
                  "Aufenthalt":20,
                  "mittelbare Nobilitierung":20,
                  " Privilegierung":20,
                  "Wappenbesserung":20,
                  "Introduktion":30, # bezogen worauf?
                  "Mitgliedschaft":30,
                  "Gesandtschaft":30, # vermutlich nicht für ganz junge Personen?
                  "Präsentation":30, # nicht sicher was das ist...
                  "Vokation":39, # Berufung an Uni?
                  "Ernennung":40,
                  "Amtseinführung":41,
                  "Vereidigung":41,
                  "Amtsantritt":42,
                  "Beförderung":44, # wie oft werden Personen durchschnittlich befördert?
                  "Ehrung":45, # vermutlich bei Personen ab Lebensmitte?
                  "Entlassung":50,
                  "Suspendierung":50,
                  "Absetzung":50,
                  "Resignation":50,
                  "Rücktritt":50,
                  "Pensionierung":90,
                  "Pension":91,
                  "Tod":100}

Is there a way to apply these values to a data frame when sorting by column without adding a new column or overwriting the old one? As we refine our research, these values will most likely change and new ones might be added, so I am looking for the most flexible system possible.

My current script correctly identifies unique names in the data sets and give me all the events connected with each person ( res_df ). For sorting, I am currently using:

res_sorted=res_df.sort_values(by =['event_type', 'event_start']) # sort by event, then date

At the moment, the event names are sorted alphabetically. How can I replace this (temporarily) for values from the above dictionary?

I think that the map function might be a good solution (cf. https://kanoki.org/2019/04/06/pandas-map-dictionary-values-with-dataframe-columns/ ), but perhaps users here have other ideas.

Maybe it's a naive idea but I would duplicate your event_type column use your dictionary to replace the event names by the numbers:

res_df.replace({"event_type_duplicated_column": event_value_dict})

And sort by this new column with the replaced values.

Assuming the dict is ordered as it is now...

I really like the answer found here .

df['event_value'] = df['event_value'].astype('category')
not_in_list = df['event_value'].cat.categories.difference(list(event_value_dict))
df['event_value'] = df['event_value'].cat.set_categories(np.hstack((list(event_value_dict), not_in_list)), ordered=True)

df = df.sort_values('event_value')

This is what my complete script using the map solution looks like:


import csv
import pandas as pd
import numpy as np

# Paths of locally synchronised EXCEL files
# all project members can adjust paths according to their own file structure

factoids='C:\\Users\\#####\\####.xlsx'

# classify events

#f=pd.read_excel(factoids)
#events_f=(f[['event_type']])
#search_event=events_f.drop_duplicates()
#print(search_event)

event_value_dict={"Sonstiges":0, 
                  "Geburt":1, 
                  "Taufe":2, 
                  "Primäre Bildungsstation":3, 
                  "Privatunterricht":3,
                  "Rezeption":4, # nicht sicher ob bezogen auf Studium?
                  "Zulassung":9, # vor dem Studium, oder z.B. auch zur Prüfung?
                  "Immatrikulation":10,
                  "Studium":11,
                  "Prüfungsverfahren":11,
                  "Graduation":12,
                  "Praktikum":13,
                  "Promotion":14,
                  "Wohnsitznahme": 20,
                  "Reise":20, # Events mit Code "20" können in der Lebensmitte mehrfach auftreten
                  "Nobilitierung":20,
                  "Aufnahme":20,
                  "Aufschwörung":20,
                  "Eheschließung":20,
                  "Funktionsausübung":20,
                  "erfolglose Bewerbung":20,
                  "Rejektion":20,
                  "Aufenthalt":20,
                  "mittelbare Nobilitierung":20,
                  " Privilegierung":20,
                  "Wappenbesserung":20,
                  "Introduktion":30, # bezogen worauf?
                  "Mitgliedschaft":30,
                  "Gesandtschaft":30, # vermutlich nicht für ganz junge Personen?
                  "Präsentation":30, # nicht sicher was das ist...
                  "Vokation":39, # Berufung an Uni?
                  "Ernennung":40,
                  "Amtseinführung":41,
                  "Vereidigung":41,
                  "Amtsantritt":42,
                  "Beförderung":44, # wie oft werden Personen durchschnittlich befördert?
                  "Ehrung":45, # vermutlich bei Personen ab Lebensmitte?
                  "Entlassung":50,
                  "Suspendierung":50,
                  "Absetzung":50,
                  "Resignation":50,
                  "Rücktritt":50,
                  "Pensionierung":90,
                  "Pension":91,
                  "Tod":100}

# read person list

f=pd.read_excel(factoids)
pers_name_f=(f[['pers_name']]) 
search_df=pers_name_f.drop_duplicates() # remove duplicates
search_list=search_df['pers_name'].tolist()

# count no. of entries in flattened person list

no_person=len(search_list)
print("There are", no_person, "unique person names in this data set.")

# add event values from dict to data frame

f['event_value'] = f['event_type'].map(event_value_dict)

# iterate through unique persons to get their events

for name in search_list:
    print(name)
    res_df=(f.loc[f['pers_name'] == name])
    res_sorted=res_df.sort_values(by =['event_value','event_after-date','event_start','event_before-date',])
   
 # write results to new sheets in EXCEL
    
    with pd.ExcelWriter('C:\\Users\\######\\#######.xlsx', engine='openpyxl', mode='a') as writer:  
        res_sorted.to_excel(writer, sheet_name=name[-14:], index="True")
     
print("Done.")```

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM