在Python中為pandas中的數據幀裝箱

Question

給出pandas中的以下數據幀：

import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

其中id是由a和b值組成的每個點的id，如何將a和b分成一組指定的bin（這樣我就可以獲取每個bin中a和b的中值/平均值）？ df中的任何給定行， df可能具有a或b （或兩者）的NaN值。 謝謝。

這是一個更好的例子，使用Joe Kington的解決方案和更逼真的df。 我不確定的是如何訪問下面每個df.a組的df.b元素：

a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...

Answer 1

可能有一種更有效的方式（我有一種感覺pandas.crosstab在這里會很有用），但這是我如何做到的：

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

編輯：由於OP專門要求b的值被b a ，只需要做

groups.mean().b

此外，如果您希望索引看起來更好（例如顯示間隔作為索引），就像在@ bdiamante的示例中那樣，請使用pandas.cut而不是numpy.digitize 。 （感謝bidamante。我沒有意識到pandas.cut存在。）

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

這導致：

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b

Answer 2

不是100％肯定這是否是你正在尋找的，但這是我認為你得到的：

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id":   np.arange(100)})

In [145]: bins = [0, .25, .5, .75, 1]

In [146]: a_bins = df.a.groupby(cut(df.a,bins))

In [147]: b_bins = df.b.groupby(cut(df.b,bins))

In [148]: a_bins.agg([mean,median])
Out[148]:
                 mean    median
a
(0, 0.25]    0.124173  0.114613
(0.25, 0.5]  0.367703  0.358866
(0.5, 0.75]  0.624251  0.626730
(0.75, 1]    0.875395  0.869843

In [149]: b_bins.agg([mean,median])
Out[149]:
                 mean    median
b
(0, 0.25]    0.147936  0.166900
(0.25, 0.5]  0.394918  0.386729
(0.5, 0.75]  0.636111  0.655247
(0.75, 1]    0.851227  0.838805

當然，我不知道你有什么箱子，所以你必須把我的東西換成你的情況。

Answer 3

Joe Kington的回答非常有用，但是，我注意到它沒有包含所有數據。 它實際上省略了a = a.min（）的行。 總結groups.size()得到99而不是100。

為了保證所有數據都被分箱，只需將bin數傳入cut（），該函數將自動將第一個[last] bin填充0.1％，以確保包含所有數據。

df = pandas.DataFrame({"a": np.random.random(100), 
                    "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))

# Get the mean of b, binned by the values in a
print(groups.mean().b)

在這種情況下，總結groups.size（）得到100。

我知道這對於這個特殊問題是一個挑剔的觀點，但對於我試圖解決的類似問題，獲得正確答案至關重要。

Answer 4

如果你不必堅持使用pandas分組，你可以使用scipy.stats.binned_statistic ：

from scipy.stats import binned_statistic

means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))

在Python中為pandas中的數據幀裝箱

問題描述

4 個解決方案

解決方案1
58 已采納 2013-06-05 20:42:45

解決方案2
24 2013-06-05 20:42:58

解決方案3
14 2014-05-16 02:26:51

解決方案4
2 2017-10-30 10:46:25

在Python中為pandas中的數據幀裝箱

問題描述

4 個解決方案

解決方案1 58 已采納 2013-06-05 20:42:45

解決方案2 24 2013-06-05 20:42:58

解決方案3 14 2014-05-16 02:26:51

解決方案4 2 2017-10-30 10:46:25

解決方案1
58 已采納 2013-06-05 20:42:45

解決方案2
24 2013-06-05 20:42:58

解決方案3
14 2014-05-16 02:26:51

解決方案4
2 2017-10-30 10:46:25