繁体   English   中英

在Pandas中为组切片设置值的最快方法

[英]Fastest way to set value for group slice in Pandas

有没有更快,更有效的方法来完成最后两行? 或许在哪里

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8,2), index=index, columns=['A', 'B'])

for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]

编辑:将数组乘以100000以模拟大数据+比较时间

你的数据:

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']*100000,
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']*100000]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8*100000,2), index=index, columns=['A', 'B'])

从那里,现在我想我在数据帧中没有任何数据信息,但它有一个'第二'索引,你想根据'second'索引为'A'列生成random.randn。

second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)

df.A = rands[second_labels]

#My solution
%%timeit
second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)
df.A = rands[second_labels]
#100 loops, best of 3: 11.1 ms per loop

#Alexander's solution
%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
#1 loops, best of 3: 188 ms per loop

您可以先创建一组附加到索引第二级( df.index.levels[1] )中每个项目的随机数。 然后,您可以使用列表推导来循环遍历该级别的每个标签并映射随机数。

np.random.seed(0)
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]

>>> df
                     A         B
first second                    
bar   one     1.764052  0.144044
      two     0.400157  0.761038
baz   one     1.764052  0.443863
      two     0.400157  1.494079
foo   one     1.764052  0.313068
      two     0.400157 -2.552990
qux   one     1.764052  0.864436
      two     0.400157  2.269755

%%timeit
for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]
1000 loops, best of 3: 1.99 ms per loop

%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
10000 loops, best of 3: 120 µs per loop

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM