[英]Pandas/Numpy: Fastest way to create a ladder?
我有一個像這樣的熊貓數據框:
color cost temp
0 blue 12.0 80.4
1 red 8.1 81.2
2 pink 24.5 83.5
我想為每行創建一個“階梯”或一個“范圍”,以50美分為增量,從當前成本以下的$ 0.50到當前成本以上的$ 0.50。 我當前的代碼類似於以下內容:
incremented_prices = []
df['original_idx'] = df.index # To know it's original label
for row in df.iterrows():
current_price = row['cost']
more_costs = numpy.arange(current_price-1, current_price+1, step=0.5)
for cost in more_costs:
row_c = row.copy()
row_c['cost'] = cost
incremented_prices.append(row_c)
df_incremented = pandas.concat(incremented_prices)
這段代碼將產生一個DataFrame,如:
color cost temp original_idx
0 blue 11.5 80.4 0
1 blue 12.0 80.4 0
2 blue 12.5 80.4 0
3 red 7.6 81.2 1
4 red 8.1 81.2 1
5 red 8.6 81.2 1
6 pink 24.0 83.5 2
7 pink 24.5 83.5 2
8 pink 25.0 83.5 2
在實際的問題中,我將使范圍從-$ 50.00到$ 50.00,我發現這確實很慢,是否有一些更快的矢量化方式?
您可以嘗試使用numpy.repeat
重新創建一個數據框:
cost_steps = pd.np.arange(-0.5, 0.51, 0.5)
repeats = cost_steps.size
pd.DataFrame(dict(
color = pd.np.repeat(df.color.values, repeats),
# here is a vectorized method to calculate the costs with all steps added with broadcasting
cost = (df.cost.values[:, None] + cost_steps).ravel(),
temp = pd.np.repeat(df.temp.values, repeats),
original_idx = pd.np.repeat(df.index.values, repeats)
))
更新更多列:
df1 = df.rename_axis("original_idx").reset_index()
cost_steps = pd.np.arange(-0.5, 0.51, 0.5)
repeats = cost_steps.size
pd.DataFrame(pd.np.hstack((pd.np.repeat(df1.drop("cost", 1).values, repeats, axis=0),
(df1.cost[:, None] + cost_steps).reshape(-1, 1))),
columns=df1.columns.drop("cost").tolist()+["cost"])
這是一個基於NumPy初始化的方法-
increments = 0.5*np.arange(-1,2) # Edit the increments here
names = np.append(df.columns, 'original_idx')
M,N = df.shape
vals = df.values
cost_col_idx = (names == 'cost').argmax()
n = len(increments)
shp = (M,n,N+1)
b = np.empty(shp,dtype=object)
b[...,:-1] = vals[:,None]
b[...,-1] = np.arange(M)[:,None]
b[...,cost_col_idx] = vals[:,cost_col_idx].astype(float)[:,None] + increments
b.shape = (-1,N+1)
df_out = pd.DataFrame(b, columns=names)
要使增量從-50
到+50
且增量為0.5
,請使用:
increments = 0.5*np.arange(-100,101)
樣品運行-
In [200]: df
Out[200]:
color cost temp newcol
0 blue 12.0 80.4 mango
1 red 8.1 81.2 banana
2 pink 24.5 83.5 apple
In [201]: df_out
Out[201]:
color cost temp newcol original_idx
0 blue 11.5 80.4 mango 0
1 blue 12 80.4 mango 0
2 blue 12.5 80.4 mango 0
3 red 7.6 81.2 banana 1
4 red 8.1 81.2 banana 1
5 red 8.6 81.2 banana 1
6 pink 24 83.5 apple 2
7 pink 24.5 83.5 apple 2
8 pink 25 83.5 apple 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.