[英]How to Fill NaNs in Column of Main Dataframe Based On Conditions Matching Secondary Dataframe of Values to Fill NaNs With Multiple Filler Values
I need to fill NA values in my main data frame based on a second dataframe I created by the groupby
and mean
functions.我需要根据由
groupby
和mean
函数创建的第二个数据帧在我的主数据帧中填充 NA 值。 My original dataframe has about 1.5K NaNs I need to fill so this needs to reproducible at a mass scale.我的原始数据框有大约 1.5K NaN 我需要填充,所以这需要大规模重现。 I've created a fake dataframe that's a short quick and dirty imitation of my data using a fake scenario.
我创建了一个假数据框,它是使用假场景对我的数据进行简短快速和肮脏的模仿。 I can't share my real data with you.
我不能与你分享我的真实数据。
My general idea is:我的总体思路是:
main_data[
(main_data["Animal_Type"] == mean_data["Animal_Type"]) &
(main_data["Cost_Type"] == mean_data["Cost_Type"])
] = main_data["Price"].fillna(mean_data["Price"])
Obviously, that doesn't work and but that's the general gist of how my logic is working.显然,这行不通,但这是我的逻辑如何工作的一般要点。 I found t[his answer][1] but I can't see to apply it properly to my problem.
我找到了 t[his answer][1] 但我看不到将它正确应用于我的问题。 A lot of answers involve
mask
or assume my data is pretty small with a single value to replace all my NaNs with.很多答案都涉及
mask
或假设我的数据非常小,只有一个值来替换我的所有 NaN。 I have about 50 different means in my original dataset that are uniquely paired with a "Animal Type" per each "Cost Type".我的原始数据集中有大约 50 种不同的方法,它们与每个“成本类型”的“动物类型”唯一配对。 My original data frame is about 30K observations long full of unique observations too.
我的原始数据框大约有 30K 个观察值,其中也充满了独特的观察值。 I can map but that's only for a single column.
我可以映射,但这仅适用于单个列。 I'm fairly new to coding so a lot of the other answers were too complicated for me too understand and alter too.
我对编码相当陌生,所以很多其他答案对我来说太复杂了,也太理解和改变了。
main_data主数据
mean_data.head(10)
**Pet_ID Animal_Type Cost_Type Price**
0 101 Goat Housing 6.0
1 102 Dog Housing 6.0
2 103 Horse Housing NaN
3 104 Horse Housing 5.0
4 105 Goat Housing 3.0
5 106 Dog Feeding 3.0
6 107 Cat Feeding 6.0
7 108 Horse Housing 6.0
8 109 Hamster Feeding 5.0
9 110 Horse Feeding 3.0
mean_data mean_data
Animal_Type Cost_Type Price
0 Cat Feeding 4.500000
1 Cat Housing 5.000000
2 Chicken Feeding 5.000000
3 Chicken Housing 4.500000
4 Dog Feeding 3.000000
5 Dog Housing 6.000000
6 Goat Feeding 5.000000
7 Goat Housing 5.000000
8 Hamster Feeding 5.250000
9 Hamster Housing 3.000000
10 Horse Feeding 3.500000
11 Horse Housing 5.666667
12 Rabit Feeding 3.000000
13 Rabit Housing 3.000000
My Reproducible code:我的可重现代码:
random.seed(10)
random.seed(10)
main_data = pd.DataFrame(columns = ["Pet_ID", "Animal_Type", "Cost_Type", "Price", "Cost"])
main_data["Pet_ID"] = pd.Series(list(range(101,150)))
main_data["Animal_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Dog", "Cat", "Rabit", "Horse", "Goat", "Chicken", "Hamster"]))
main_data["Cost_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Housing", "Feeding"]))
main_data["Price"] = main_data.Price.apply(lambda x: random.choice([3, 5, 6, np.nan]))
main_data["Cost"] = main_data.Cost.apply(lambda x: random.choice([2, 1, 3, np.nan]))
mean_data = main_data.groupby(["Animal_Type", "Cost_Type"])["Price"].mean().reset_index()
Edit: I have put together two solutions but I wouldn't say it's the more elegant or dependable.编辑:我已经把两个解决方案放在一起,但我不会说它更优雅或更可靠。 Probably not the most efficient too.
可能也不是最有效的。
main_data = pd.merge(
main_data,
mean_data,
on = ["Animal_Type", "Cost_Type"],
how = "left"
)
main_data["Price_z"] = main_data["Price_x"].fillna(main_data["Price_y"])
Edit 2 : I've added a "Cost" Column with NaNs.编辑 2 :我添加了一个带有 NaN 的“成本”列。 I don't want this column touched but would like to use the same methodology with this column we're using for the Price column.
我不希望触及此列,但希望对我们用于价格列的此列使用相同的方法。 [1]: Replace values based on multiple conditions with groupby mean in Pandas
[1]: 在 Pandas 中用 groupby mean 替换基于多个条件的值
I need to fill NA values in my main data frame based on a second dataframe I created by the
groupby
andmean
functions.我需要根据由
groupby
和mean
函数创建的第二个数据帧在我的主数据帧中填充 NA 值。
You don't need that step.你不需要那一步。 You can do this in one step by grouping into multiple dataframes, applying mean on each individual dataframe, and filling NA values within just that dataframe.
您可以通过分组到多个数据帧、在每个单独的数据帧上应用平均值并仅在该数据帧中填充 NA 值来一步完成此操作。
So, instead of creating the mean_data
dataframe, do this:因此,不要创建
mean_data
数据帧,而是执行以下操作:
def fill_by_mean(df):
df["Price"] = df["Price"].fillna(df["Price"].mean())
return df
main_data = main_data.groupby(["Animal_Type", "Cost_Type"]).apply(fill_by_mean)
Each individual call to fill_by_mean() sees a dataframe which looks like this:对 fill_by_mean() 的每个单独调用都会看到一个如下所示的数据帧:
Pet_ID Animal_Type Cost_Type Price
11 112 Rabit Feeding NaN
34 135 Rabit Feeding 3.0
38 139 Rabit Feeding 3.0
Then it gets the mean of the price column and fills NA values using that.然后它获取价格列的平均值并使用它填充 NA 值。 Groupby then concatenates all of the individual dataframes back together.
Groupby 然后将所有单独的数据帧重新连接在一起。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.