Python Pandas 用“其他”替換列中的新值

Question

我有一個帶有 30 個不同級別的因子列的 pandas 數據框。 有些級別很少出現，所以我將它們轉換為“其他”分組。 結果列有 25 個不同的級別加上 1 個“其他”級別。

d = df1['column1'].value_counts() >= 50
df1['column1'] = [i if d[i] else 'Other' for i in df1['column1']]
df1['column1'] = df1['column1'].astype('category')

我有第二個數據幀，我想將其轉換為與第一個數據幀具有相同級別（包括第一個數據幀中未出現的任何新級別）。 我已經嘗試了下面的代碼，但我得到了一個“關鍵錯誤”，但它並沒有真正解釋這個問題。

df2['column1'] = [i if d[i] else 'Other' for i in df2['column1']]
df2['column1'] = df2['column1'].astype('category')

知道是什么原因造成的嗎？

Answer 1

通過將值注入df1['column1'] df2['column1']不存在的值，我能夠用您的代碼重現您的Key Error 。

您可以通過執行以下操作使該過程具有彈性：

df1 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(10, size=100)]})

df2 包含附加值：

df2 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(12, size=100)]})

獲取最頻繁的關卡並翻譯：

cat_counts = df1['column1'].value_counts()

df1.assign(column1=np.where(df1['column1'].isin(cat_counts[cat_counts > 10].index), df1['column1'], 'other')).astype({'column1': 'category'})

   column1
0       L4
1       L9
2       L9
3    other
4    other
..     ...
95   other
96   other
97   other
98      L3
99   other

同樣的構造也適用於 df2，即使它包含 df1 中不存在的值：

df2.assign(column1=np.where(df2['column1'].isin(cat_counts[cat_counts > 10].index), df2['column1'], 'other')).astype({'column1': 'category'})

   column1
0    other
1       L9
2    other
3    other
4    other
..     ...
95   other
96   other
97   other
98      L9
99   other

另一種選擇是選擇 n 個最頻繁的級別：

df1['column1'].isin(cat_counts.nlargest(5).index)

Python Pandas 用“其他”替換列中的新值

問題描述

1 個解決方案

解決方案1
1 已采納 2022-06-01 22:12:57

Python Pandas 用“其他”替換列中的新值

問題描述

1 個解決方案

解決方案1 1 已采納 2022-06-01 22:12:57

解決方案1
1 已采納 2022-06-01 22:12:57