![](/img/trans.png)
[英]How to replace certain rows by shared column values in pandas DataFrame?
[英]How to replace certain values in a pandas column with the mean column value of similar rows?
我目前有一個pandas數據框,其中包含來自此 kaggle數據集的屬性信息。 以下是該集合的示例數據框:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Annadale | 5 | 5425 | 2015 | ... |
| Woodside | 4 | 2327 | 1966 | ... |
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 405 | 1996 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
我想要做的是獲取“year built”列中的值等於零的每一行,並將這些行中的“year built”值替換為具有相同鄰域的行中“year built”值的中值,自治市鎮和街區。 在某些情況下,{neighborhood,borough,block}集合中有多個行在“year built”列中具有零。 這在上面的示例數據框中顯示。
為了說明問題,我將這兩行放在示例數據框中。
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
為了解決這個問題,我想使用具有相同鄰域,行政區和塊的所有其他行中的“年建”值的平均值來填充“年建”值在“年”中為零的行中建立“專欄。 對於示例行,鄰域是Alphabet City,行政區是1,塊是396所以我將使用示例數據幀中的以下匹配行來計算平均值:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
我將從這些行(即1987.4)中取出“year built”列的平均值,並用均值替換零。 最初有零的行最終看起來像這樣:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1987.4 | ... |
| Alphabet City | 1 | 396 | 1987.4 | ... |
我到目前為止所做的就是在“年建”欄中刪除帶有零的行,並找到每個{鄰域,區域,塊}集的平均年份。 原始數據幀存儲在raw_data中,它看起來就像本文最頂部的示例數據幀。 代碼如下所示:
# create a copy of the data
temp_data = raw_data.copy()
# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]
# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()
輸出看起來像這樣:
| neighborhood | borough | block | year built |
------------------------------------------------
| .... | ... | ... | ... |
| Alphabet City | 1 | 390 | 1985.342 |
| Alphabet City | 1 | 391 | 1986.76 |
| Alphabet City | 1 | 392 | 1992.8473 |
| Alphabet City | 1 | 393 | 1990.096 |
| Alphabet City | 1 | 394 | 1984.45 |
那么如何從mean_year_by_location數據幀中取出那些平均的“年建”值並替換原始raw_data數據幀中的零?
我為這篇長篇大論道歉。 我只想非常清楚。
使用set_index
+ replace
,然后使用fillna
on mean
。
v = df.set_index(
['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)
df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
df
neighborhood borough block year built
0 Annadale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
細節
首先,設置索引,並用NaN替換0,以便即將進行的mean
計算不受這些值的影響 -
v = df.set_index(
['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)
v
neighborhood borough block
Annadale 5 5425 2015.0
Woodside 4 2327 1966.0
Alphabet City 1 396 1985.0
405 1996.0
396 1986.0
396 1992.0
396 NaN
396 1990.0
396 1984.0
396 NaN
Name: year built, dtype: float64
接下來,計算mean
-
m = v.mean(level=[0, 1, 2])
m
neighborhood borough block
Annadale 5 5425 2015.0
Woodside 4 2327 1966.0
Alphabet City 1 396 1987.4
405 1996.0
Name: year built, dtype: float64
這用作映射,我們將傳遞給fillna
。 fillna
相應地替換前面介紹的NaN,並用索引映射的相應平均值替換它們。 完成后,只需重置索引即可恢復原始結構。
v.fillna(m).reset_index()
neighborhood borough block year built
0 Annadale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
我將在groupby.apply
使用mask
。 我這樣做只是因為我喜歡它流動的方式。 我沒有聲稱它特別快速。 然而,這個答案可能會提供一些可能的替代方案。
gidx = ['neighborhood', 'borough', 'block']
def fill_with_mask(s):
mean = s.loc[lambda x: x != 0].mean()
return s.mask(s.eq(0), mean)
df.groupby(gidx)['year built'].apply(fill_with_mask)
0 2015.0
1 1966.0
2 1985.0
3 1996.0
4 1986.0
5 1992.0
6 1987.4
7 1990.0
8 1984.0
9 1987.4
Name: year built, dtype: float64
然后,我們可以使用pd.DataFrame.assign
創建數據pd.DataFrame.assign
的副本
df.assign(**{'year built': df.groupby(gidx)['year built'].apply(fill_with_mask)})
neighborhood borough block year built
0 Annadale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
可以使用列分配完成相同的任務:
df['year built'] = df.groupby(gidx)['year built'].apply(fill_with_mask)
要么
df.update(df.groupby(gidx)['year built'].apply(fill_with_mask))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.