熊貓中的多列分解

Question

pandas factorize函數將系列中的每個唯一值分配給一個從 0 開始的順序索引，並計算每個系列條目屬於哪個索引。

我想在多列上完成相當於pandas.factorize ：

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]

也就是說，我想確定數據幀的幾列中每個唯一的值元組，為每個值分配一個順序索引，並計算數據幀中的每一行屬於哪個索引。

Factorize僅適用於單列。 Pandas 中是否有多列等效函數？

Answer 1

您需要先創建一個元組的 ndarray， pandas.lib.fast_zip可以在 cython 循環中非常快地完成此操作。

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]

輸出是：

[0 1 2 2 1 0]

Answer 2

我不確定這是否是一個有效的解決方案。 可能有更好的解決方案。

arr=[] #this will hold the unique items of the dataframe
for i in df.index:
   if list(df.iloc[i]) not in arr:
      arr.append(list(df.iloc[i]))

所以打印 arr 會給你

>>>print arr
[[1,1],[1,2],[2,2]]

為了保存索引，我會聲明一個 ind 數組

ind=[]
for i in df.index:
   ind.append(arr.index(list(df.iloc[i])))

印刷工業會給

 >>>print ind
 [0,1,2,2,1,0]

Answer 3

您可以使用drop_duplicates刪除那些重復的行

In [23]: df.drop_duplicates()
Out[23]: 
      x  y
   0  1  1
   1  1  2
   2  2  2

編輯

為了實現您的目標，您可以將原始 df 加入 drop_duplicated 一個：

In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]: 
   x  y  index
0  1  1      0
1  1  2      1
2  2  2      2
3  2  2      2
4  1  2      1
5  1  1      0

Answer 4

df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

熊貓中的多列分解

問題描述

4 個解決方案

解決方案1
14 已采納 2013-05-09 08:30:39

解決方案2
1 2013-05-09 04:40:21

解決方案3
0 2013-05-09 02:58:48

編輯

解決方案4
0 2017-09-13 19:58:11

熊貓中的多列分解

問題描述

4 個解決方案

解決方案1 14 已采納 2013-05-09 08:30:39

解決方案2 1 2013-05-09 04:40:21

解決方案3 0 2013-05-09 02:58:48

編輯

解決方案4 0 2017-09-13 19:58:11

解決方案1
14 已采納 2013-05-09 08:30:39

解決方案2
1 2013-05-09 04:40:21

解決方案3
0 2013-05-09 02:58:48

解決方案4
0 2017-09-13 19:58:11