简体   繁体   English

在Pandas中为组切片设置值的最快方法

[英]Fastest way to set value for group slice in Pandas

Is there a faster, more efficient way to do the last two lines? 有没有更快,更有效的方法来完成最后两行? Perhaps with where ? 或许在哪里

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8,2), index=index, columns=['A', 'B'])

for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]

EDIT: multiply the array for 100000 to simulate big data + compare timings 编辑:将数组乘以100000以模拟大数据+比较时间

Your data: 你的数据:

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']*100000,
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']*100000]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8*100000,2), index=index, columns=['A', 'B'])

From there, now I suppose I don't have any information of data in the dataframe but that it have a 'second' index and you want to generate random.randn for 'A' column depending on 'second' index. 从那里,现在我想我在数据帧中没有任何数据信息,但它有一个'第二'索引,你想根据'second'索引为'A'列生成random.randn。

second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)

df.A = rands[second_labels]

#My solution
%%timeit
second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)
df.A = rands[second_labels]
#100 loops, best of 3: 11.1 ms per loop

#Alexander's solution
%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
#1 loops, best of 3: 188 ms per loop

You can first create a set of random numbers attached to each of the items in the second level of the index ( df.index.levels[1] ). 您可以先创建一组附加到索引第二级( df.index.levels[1] )中每个项目的随机数。 Then you can use a list comprehension to cycle through each label of that level and map the random number. 然后,您可以使用列表推导来循环遍历该级别的每个标签并映射随机数。

np.random.seed(0)
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]

>>> df
                     A         B
first second                    
bar   one     1.764052  0.144044
      two     0.400157  0.761038
baz   one     1.764052  0.443863
      two     0.400157  1.494079
foo   one     1.764052  0.313068
      two     0.400157 -2.552990
qux   one     1.764052  0.864436
      two     0.400157  2.269755

%%timeit
for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]
1000 loops, best of 3: 1.99 ms per loop

%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
10000 loops, best of 3: 120 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM