如何用類似行的平均列值替換pandas列中的某些值？

Question

問題

我目前有一個pandas數據框，其中包含來自此 kaggle數據集的屬性信息。 以下是該集合的示例數據框：

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Annadale      | 5       | 5425  | 2015       | ... |
| Woodside      | 4       | 2327  | 1966       | ... |
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 405   | 1996       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |
| Alphabet City | 1       | 396   | 0          | ... |

我想要做的是獲取“year built”列中的值等於零的每一行，並將這些行中的“year built”值替換為具有相同鄰域的行中“year built”值的中值，自治市鎮和街區。 在某些情況下，{neighborhood，borough，block}集合中有多個行在“year built”列中具有零。 這在上面的示例數據框中顯示。

為了說明問題，我將這兩行放在示例數據框中。

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 0          | ... |

為了解決這個問題，我想使用具有相同鄰域，行政區和塊的所有其他行中的“年建”值的平均值來填充“年建”值在“年”中為零的行中建立“專欄。 對於示例行，鄰域是Alphabet City，行政區是1，塊是396所以我將使用示例數據幀中的以下匹配行來計算平均值：

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |

我將從這些行（即1987.4）中取出“year built”列的平均值，並用均值替換零。 最初有零的行最終看起來像這樣：

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1987.4     | ... |
| Alphabet City | 1       | 396   | 1987.4     | ... |

我到目前為止的代碼

我到目前為止所做的就是在“年建”欄中刪除帶有零的行，並找到每個{鄰域，區域，塊}集的平均年份。 原始數據幀存儲在raw_data中，它看起來就像本文最頂部的示例數據幀。 代碼如下所示：

# create a copy of the data
temp_data = raw_data.copy()

# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]

# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()

輸出看起來像這樣：

| neighborhood  | borough | block | year built | 
------------------------------------------------
| ....          | ...     | ...   | ...        |
| Alphabet City | 1       | 390   | 1985.342   | 
| Alphabet City | 1       | 391   | 1986.76    | 
| Alphabet City | 1       | 392   | 1992.8473  | 
| Alphabet City | 1       | 393   | 1990.096   | 
| Alphabet City | 1       | 394   | 1984.45    |

那么如何從mean_year_by_location數據幀中取出那些平均的“年建”值並替換原始raw_data數據幀中的零？

我為這篇長篇大論道歉。 我只想非常清楚。

Answer 1

使用set_index + replace ，然后使用fillna on mean 。

v = df.set_index(
    ['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)   

df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
df

    neighborhood  borough  block  year built
0       Annadale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

細節

首先，設置索引，並用NaN替換0，以便即將進行的mean計算不受這些值的影響 -

v = df.set_index(
    ['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)   

v 

neighborhood   borough  block
Annadale       5        5425     2015.0
Woodside       4        2327     1966.0
Alphabet City  1        396      1985.0
                        405      1996.0
                        396      1986.0
                        396      1992.0
                        396         NaN
                        396      1990.0
                        396      1984.0
                        396         NaN
Name: year built, dtype: float64

接下來，計算mean -

m = v.mean(level=[0, 1, 2])
m

neighborhood   borough  block
Annadale       5        5425     2015.0
Woodside       4        2327     1966.0
Alphabet City  1        396      1987.4
                        405      1996.0
Name: year built, dtype: float64

這用作映射，我們將傳遞給fillna 。 fillna相應地替換前面介紹的NaN，並用索引映射的相應平均值替換它們。 完成后，只需重置索引即可恢復原始結構。

v.fillna(m).reset_index()

    neighborhood  borough  block  year built
0       Annadale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

Answer 2

我將在groupby.apply使用mask 。 我這樣做只是因為我喜歡它流動的方式。 我沒有聲稱它特別快速。 然而，這個答案可能會提供一些可能的替代方案。

gidx = ['neighborhood', 'borough', 'block']

def fill_with_mask(s):
    mean = s.loc[lambda x: x != 0].mean()
    return s.mask(s.eq(0), mean)

df.groupby(gidx)['year built'].apply(fill_with_mask)

0    2015.0
1    1966.0
2    1985.0
3    1996.0
4    1986.0
5    1992.0
6    1987.4
7    1990.0
8    1984.0
9    1987.4
Name: year built, dtype: float64

然后，我們可以使用pd.DataFrame.assign創建數據pd.DataFrame.assign的副本

df.assign(**{'year built': df.groupby(gidx)['year built'].apply(fill_with_mask)})

    neighborhood  borough  block  year built
0       Annadale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

可以使用列分配完成相同的任務：

df['year built'] = df.groupby(gidx)['year built'].apply(fill_with_mask)

要么

df.update(df.groupby(gidx)['year built'].apply(fill_with_mask))

如何用類似行的平均列值替換pandas列中的某些值？

問題描述

問題

我到目前為止的代碼

2 個解決方案

解決方案1
4 已采納 2018-01-08 05:51:48

解決方案2
2 2018-01-08 08:57:44

如何用類似行的平均列值替換pandas列中的某些值？

問題描述

問題

我到目前為止的代碼

2 個解決方案

解決方案1 4 已采納 2018-01-08 05:51:48

解決方案2 2 2018-01-08 08:57:44

解決方案1
4 已采納 2018-01-08 05:51:48

解決方案2
2 2018-01-08 08:57:44