在 Pandas DataFrame 中轉換列值的最有效方法

Question

我有一個 pd.DataFrame 看起來像：

我想在值上創建一個截止值以將它們推入二進制數字，在這種情況下我的截止值是0.85 。 我希望生成的數據框看起來像：

我為此編寫的腳本很容易理解，但對於大型數據集來說效率很低。 我確信 Pandas 有某種方法來處理這些類型的轉換。

有誰知道使用閾值將一列浮點數轉換為一列整數的有效方法？

我做這種事情的極其天真的方式：

DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])
DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])


threshold = 0.85

#Empty dataframe to append rows
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
    #Get first 2 columns
    first2cols = list(DF_test.ix[i][:-1])
    #Check if value is greater than threshold
    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
    #Create series object
    SR_row = pd.Series( first2cols + binary_value,name=i)
    #Add to empty dataframe container
    DF_naive = DF_naive.append(SR_row)
#Relabel columns
DF_naive.columns = DF_test.columns
DF_naive.head()
#the sample DF_want

Answer 1

您可以使用np.where根據布爾條件設置所需的值：

In [18]:
DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)
DF_test

Out[18]:
  c1 c2  value
0  a  p      0
1  b  q      0
2  c  r      1
3  d  s      1
4  e  t      0

請注意，由於您的數據是異構 np 數組，因此“值”列包含字符串而不是浮點數：

In [58]:
DF_test.iloc[0]['value']

Out[58]:
'0.12'

所以你需要將轉換dtype到float第一： DF_test['value'] = DF_test['value'].astype(float)

您可以比較時間：

In [16]:
%timeit np.where(DF_test['value'] > threshold, 1,0)
1000 loops, best of 3: 297 µs per loop

In [17]:
%%timeit
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
    #Get first 2 columns
    first2cols = list(DF_test.ix[i][:-1])
    #Check if value is greater than threshold
    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
    #Create series object
    SR_row = pd.Series( first2cols + binary_value,name=i)
    #Add to empty dataframe container
    DF_naive = DF_naive.append(SR_row)
10 loops, best of 3: 39.3 ms per loop

np.where版本快了 100 倍以上，誠然您的代碼做了很多不必要的事情，但您明白了

Answer 2

由於bool是int的子類，即True == 1和False == 0 ，您可以將布爾系列轉換為其整數形式：

DF_test['value'] = (DF_test['value'] > threshold).astype(int)

通常，包括計算或索引中的大多數用途， int轉換不是必需的，您可能希望完全放棄它。

在 Pandas DataFrame 中轉換列值的最有效方法

問題描述

2 個解決方案

解決方案1
12 已采納 2016-02-25 22:32:52

解決方案2
1 2018-11-18 23:44:34

在 Pandas DataFrame 中轉換列值的最有效方法

問題描述

2 個解決方案

解決方案1 12 已采納 2016-02-25 22:32:52

解決方案2 1 2018-11-18 23:44:34

解決方案1
12 已采納 2016-02-25 22:32:52

解決方案2
1 2018-11-18 23:44:34