简体   繁体   中英

Populating several values in an empty column in a row, based on a value from another column

I already have an idea as to how I'm going to do this - I'm just curious about whether my method is the most efficient.

So for instance, let's say that for whatever reason, I have the following table:图片

The first 4 columns in the table are all repeated - they just say info about the employee. The reason these rows repeat is because that employee handles multiple clients.

In some cases, I am missing info on the Age and Employee duration of an employee. Another colleague gave me this information in an excel sheet.

So now, I have info on Brian's and Dennis' age and employment duration, and I need to fill all rows with their employee IDs based on the information. My plan for doing that is this:

data = {"14": # Brian's Employee ID
{"Age":31,
:"Employment Duration":3},
"21": # Dennis' Employee ID
{"Age":45,
"Employment Duratiaon":12}
}

After making the above dictionary of dictionaries with the necessary values, my plan is to iterate over each row in the above dataframe, and fill in the 'Age' and 'Employment Duration' columns based on the value in 'Employee ID':

for index, row in df.iterrows:
if row["Employee ID"] in data:
    row["Age"] = data["Employee ID"]["Age"]
    row["Employment Duration"] = data["Employee ID"]["Employement Duration"]

That's my plan for populating the missing values!

I'm curious about whether there's a simpler way that's just not presenting itself to me, because this was the first thing that sprang to mind!

Don't iterate over rows in pandas when you can avoid it. Instead maximize the pandas library with actions like this:

Assume we have a dataframe:

data = pd.DataFrame({
    'name' : ['john', 'john', 'mary', 'mary'],
    'age'  : ['', '', 25, 25]
})

Which looks like:

   name age
0  john    
1  john    
2  mary  25
3  mary  25

We can apply a lambda function like so:

data['age'] = data.apply(lambda x: 27 if x.name == 'john' else x.age, axis=1)

Or we can use pandas .loc:

data['age'].loc[data.name == 'john'] = 27

Test them out and compare how long each take to execute vs. iterating over rows.

Ensure missing values are represented as null values ( np.NaN ). The second set of information should be stored in another DataFrame with the same column labels.

Then by setting the Index to the 'Employee ID' update will align on the indices and fill the missing values.

Sample Data

import pandas as pd
import numpy as np

df = pd.DataFrame({'Employee ID': ["11", "11", "14", "21"],
                   'Name': ['Alan', 'Alan', 'Brian', 'Dennis'],
                   'Age': [14,14, np.NaN, np.NaN],
                   'Employment Duration': [3,3, np.NaN, np.NaN],
                   'Clients Handled': ['A', 'B', 'C', 'G']})
data = {"14": {"Age": 31, "Employment Duration": 3},
        "21": {"Age": 45, "Employment Duration": 12}}

df2 = pd.DataFrame.from_dict(data, orient='index')

Code

#df = df.replace('', np.NaN) # If not null in your dataset
df = df.set_index('Employee ID')

df.update(df2, overwrite=False)
print(df)

               Name   Age  Employment Duration Clients Handled
Employee ID                                                    
11             Alan  14.0                  3.0                A
11             Alan  14.0                  3.0                B
14            Brian  31.0                  3.0                C
21           Dennis  45.0                 12.0                G

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM