简体   繁体   中英

How do I iterate over a pandas dataframe to impute missing values that are present in another data frame?

I am trying to impute the NaN's in a column with the values present in same column but I cannot figure out how to map them using another column.

I have two pandas DataFrames, the first one(df) has all the values and looks like this:

|Sr. No| Fares|   Route    |
|------|------|------------|
|  1   | 123  |ABE-PGD-ABE |
|  2   | 456  |ABQ-SLC-ABQ |
|  3   | 789  |ALB-SJU-ALB |

The second DataFrame(df1) looks like this:

|Sr. No| Fares|   Route    |
|------|------|------------|
|  130 | NaN  |ABE-PGD-ABE |
|  297 | NaN  |ABQ-SLC-ABQ |
|  345 | NaN  |ALB-SJU-ALB |

Now I want to impute the NaN in the Fares column for all the Routes that match. Also the second DataFrame is just a subset of the first one because I wanted to isolate all the NaNs in the Fare column.

Here is my code:

for i in df_1: 
     df[Fare] = df[Fare].map({'Nan': ''})

Please let me know what I am doing wrong, I don't know what to map it with so I have left the value for 'Nan' blank.

You have a few things going on here.

Firstly, when you iterate a DataFrame like for i in df , you are actually iterating the columns (or Series), not the rows as you might expect. You can access a row iterator by df.iterrows() , which looks like

for row_index, row in df.iterrows():
    # row is a pd.Series, which is like a vector / array / tuple

Within the loop you need to "pull out" the route , then use that route to "look up" the fare in the other DataFrame.

for row_index, row in df.iterrows():
    route = row["Route"]

    # find other rows that match this Route
    other_rows = df_other[df_other["Route"]==route]

    # if there isn"t exactly only one row, skip
    if len(other_rows) != 1:
        continue

    # this is how we can set a value in a dataframe
    df.loc[row_index, "Fares"] = other_rows.iloc[0]["fares"]

Having said all this, we wouldn't normally treat a DataFrame as a list of rows to iterate. Think of it as a database table and try for set-based operations.

Here's how I would do this:

# index these so we can compare corresponding rows
df = df.set_index("Route")
df_other = df_other.set_index("Route")

# combine first performs a "if null" kind of coalescing
combined = df.combine_first(df_other)

# the index ensures that we are updating rows correctly
df["Fares"] = combined["Fares"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM