在Pandas中为组切片设置值的最快方法

Question

Is there a faster, more efficient way to do the last two lines? 有没有更快，更有效的方法来完成最后两行？ Perhaps with where ? 或许在哪里？

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8,2), index=index, columns=['A', 'B'])

for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]

Answer 1

EDIT: multiply the array for 100000 to simulate big data + compare timings 编辑：将数组乘以100000以模拟大数据+比较时间

Your data: 你的数据：

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']*100000,
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']*100000]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8*100000,2), index=index, columns=['A', 'B'])

From there, now I suppose I don't have any information of data in the dataframe but that it have a 'second' index and you want to generate random.randn for 'A' column depending on 'second' index. 从那里，现在我想我在数据帧中没有任何数据信息，但它有一个'第二'索引，你想根据'second'索引为'A'列生成random.randn。

second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)

df.A = rands[second_labels]

#My solution
%%timeit
second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)
df.A = rands[second_labels]
#100 loops, best of 3: 11.1 ms per loop

#Alexander's solution
%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
#1 loops, best of 3: 188 ms per loop

Answer 2

You can first create a set of random numbers attached to each of the items in the second level of the index ( df.index.levels[1] ). 您可以先创建一组附加到索引第二级（ df.index.levels[1] ）中每个项目的随机数。 Then you can use a list comprehension to cycle through each label of that level and map the random number. 然后，您可以使用列表推导来循环遍历该级别的每个标签并映射随机数。

np.random.seed(0)
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]

>>> df
                     A         B
first second                    
bar   one     1.764052  0.144044
      two     0.400157  0.761038
baz   one     1.764052  0.443863
      two     0.400157  1.494079
foo   one     1.764052  0.313068
      two     0.400157 -2.552990
qux   one     1.764052  0.864436
      two     0.400157  2.269755

%%timeit
for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]
1000 loops, best of 3: 1.99 ms per loop

%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
10000 loops, best of 3: 120 µs per loop

在Pandas中为组切片设置值的最快方法

问题描述

2 个解决方案

解决方案1
1 2015-12-17 01:29:11

EDIT: multiply the array for 100000 to simulate big data + compare timings 编辑：将数组乘以100000以模拟大数据+比较时间

解决方案2
1 2015-12-17 06:06:22

在Pandas中为组切片设置值的最快方法

问题描述

2 个解决方案

解决方案1 1 2015-12-17 01:29:11

EDIT: multiply the array for 100000 to simulate big data + compare timings 编辑：将数组乘以100000以模拟大数据+比较时间

解决方案2 1 2015-12-17 06:06:22

解决方案1
1 2015-12-17 01:29:11

解决方案2
1 2015-12-17 06:06:22