简体   繁体   English

如何根据匹配值的辅助数据框的条件在主数据框的列中填充 NaN 以使用多个填充值填充 NaN

[英]How to Fill NaNs in Column of Main Dataframe Based On Conditions Matching Secondary Dataframe of Values to Fill NaNs With Multiple Filler Values

I need to fill NA values in my main data frame based on a second dataframe I created by the groupby and mean functions.我需要根据由groupbymean函数创建的第二个数据帧在我的主数据帧中填充 NA 值。 My original dataframe has about 1.5K NaNs I need to fill so this needs to reproducible at a mass scale.我的原始数据框有大约 1.5K NaN 我需要填充,所以这需要大规模重现。 I've created a fake dataframe that's a short quick and dirty imitation of my data using a fake scenario.我创建了一个假数据框,它是使用假场景对我的数据进行简短快速和肮脏的模仿。 I can't share my real data with you.我不能与你分享我的真实数据。

My general idea is:我的总体思路是:

main_data[
          (main_data["Animal_Type"] == mean_data["Animal_Type"]) & 
          (main_data["Cost_Type"] == mean_data["Cost_Type"])
         ] = main_data["Price"].fillna(mean_data["Price"])

Obviously, that doesn't work and but that's the general gist of how my logic is working.显然,这行不通,但这是我的逻辑如何工作的一般要点。 I found t[his answer][1] but I can't see to apply it properly to my problem.我找到了 t[his answer][1] 但我看不到将它正确应用于我的问题。 A lot of answers involve mask or assume my data is pretty small with a single value to replace all my NaNs with.很多答案都涉及mask或假设我的数据非常小,只有一个值来替换我的所有 NaN。 I have about 50 different means in my original dataset that are uniquely paired with a "Animal Type" per each "Cost Type".我的原始数据集中有大约 50 种不同的方法,它们与每个“成本类型”的“动物类型”唯一配对。 My original data frame is about 30K observations long full of unique observations too.我的原始数据框大约有 30K 个观察值,其中也充满了独特的观察值。 I can map but that's only for a single column.我可以映射,但这仅适用于单个列。 I'm fairly new to coding so a lot of the other answers were too complicated for me too understand and alter too.我对编码相当陌生,所以很多其他答案对我来说太复杂了,也太理解和改变了。

main_data主数据

mean_data.head(10)

   **Pet_ID Animal_Type Cost_Type   Price**
0   101     Goat        Housing     6.0
1   102     Dog         Housing     6.0
2   103     Horse       Housing     NaN
3   104     Horse       Housing     5.0
4   105     Goat        Housing     3.0
5   106     Dog         Feeding     3.0
6   107     Cat         Feeding     6.0
7   108     Horse       Housing     6.0
8   109     Hamster     Feeding     5.0
9   110     Horse       Feeding     3.0

mean_data mean_data

    Animal_Type Cost_Type   Price
0   Cat         Feeding     4.500000
1   Cat         Housing     5.000000
2   Chicken     Feeding     5.000000
3   Chicken     Housing     4.500000
4   Dog         Feeding     3.000000
5   Dog         Housing     6.000000
6   Goat        Feeding     5.000000
7   Goat        Housing     5.000000
8   Hamster     Feeding     5.250000
9   Hamster     Housing     3.000000
10  Horse       Feeding     3.500000
11  Horse       Housing     5.666667
12  Rabit       Feeding     3.000000
13  Rabit       Housing     3.000000

My Reproducible code:我的可重现代码:

random.seed(10)

random.seed(10)

main_data = pd.DataFrame(columns = ["Pet_ID", "Animal_Type", "Cost_Type", "Price", "Cost"])

main_data["Pet_ID"] = pd.Series(list(range(101,150)))
main_data["Animal_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Dog", "Cat", "Rabit", "Horse", "Goat", "Chicken", "Hamster"])) 
main_data["Cost_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Housing", "Feeding"])) 
main_data["Price"] = main_data.Price.apply(lambda x: random.choice([3, 5, 6, np.nan])) 
main_data["Cost"] =  main_data.Cost.apply(lambda x: random.choice([2, 1, 3, np.nan])) 

mean_data = main_data.groupby(["Animal_Type", "Cost_Type"])["Price"].mean().reset_index()

Edit: I have put together two solutions but I wouldn't say it's the more elegant or dependable.编辑:我已经把两个解决方案放在一起,但我不会说它更优雅或更可靠。 Probably not the most efficient too.可能也不是最有效的。

main_data = pd.merge(
    main_data,
    mean_data,
    on = ["Animal_Type", "Cost_Type"],
    how = "left"
)

main_data["Price_z"] = main_data["Price_x"].fillna(main_data["Price_y"])

Edit 2 : I've added a "Cost" Column with NaNs.编辑 2 :我添加了一个带有 NaN 的“成本”列。 I don't want this column touched but would like to use the same methodology with this column we're using for the Price column.我不希望触及此列,但希望对我们用于价格列的此列使用相同的方法。 [1]: Replace values based on multiple conditions with groupby mean in Pandas [1]: 在 Pandas 中用 groupby mean 替换基于多个条件的值

I need to fill NA values in my main data frame based on a second dataframe I created by the groupby and mean functions.我需要根据由groupbymean函数创建的第二个数据帧在我的主数据帧中填充 NA 值。

You don't need that step.你不需要那一步。 You can do this in one step by grouping into multiple dataframes, applying mean on each individual dataframe, and filling NA values within just that dataframe.您可以通过分组到多个数据帧、在每个单独的数据帧上应用平均值并仅在该数据帧中填充 NA 值来一步完成此操作。

So, instead of creating the mean_data dataframe, do this:因此,不要创建mean_data数据帧,而是执行以下操作:

def fill_by_mean(df):
    df["Price"] = df["Price"].fillna(df["Price"].mean())
    return df

main_data = main_data.groupby(["Animal_Type", "Cost_Type"]).apply(fill_by_mean)

Each individual call to fill_by_mean() sees a dataframe which looks like this:对 fill_by_mean() 的每个单独调用都会看到一个如下所示的数据帧:

    Pet_ID Animal_Type Cost_Type  Price
11     112       Rabit   Feeding    NaN
34     135       Rabit   Feeding    3.0
38     139       Rabit   Feeding    3.0

Then it gets the mean of the price column and fills NA values using that.然后它获取价格列的平均值并使用它填充 NA 值。 Groupby then concatenates all of the individual dataframes back together. Groupby 然后将所有单独的数据帧重新连接在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM