简体   繁体   English

快速将 Pandas 系列标签转换为对应列的间接值系列

[英]Quickly convert Pandas Series of labels into Series of indirect values from corresponding columns

I have following example dataframe:我有以下示例 dataframe:

N = np.arange(1, 10)
df = pd.DataFrame({
    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'a':   [   1,    2,    3,    4,    5,    6,    7,    8,    9],
    'b':   [  10,   20,   30,   40,   50,   60,   70,   80,   90],
    'c':   [ 100,  200,  300,  400,  500,  600,  700,  800,  900],
    'd':   [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})

I want to "dereference" ref column in some way, to get this:我想以某种方式“取消引用” ref列,以获得这个:

    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'ind': [   1,   20,  300, 4000,  500,   60,    7,   80,  900],

So each value in ind should correspond to the value in column labeled from ref at the same position.因此, ind中的每个值都应对应于同一 position 中从ref标记的列中的值。

Naïve approach would be to use something like df[df['ref']] , then multiply by identity matrix, then sum it column-wise.天真的方法是使用类似df[df['ref']]东西,然后乘以单位矩阵,然后按列求和。 But because I have quite large (~8 GB) dataframe, doing this, I guess, would nearly square its size.但是因为我有相当大的(~8 GB) dataframe,所以我猜这样做几乎会成正比。 And it just doesn't feel right.而且感觉不对劲。

Also due to the size just iterating over it is painfully slow.此外,由于只是迭代它的大小非常缓慢。 And I can't iterate with Cython, because converting this dataframe into numpy array loses label information, which I need to properly find the column.而且我无法使用 Cython 进行迭代,因为将此 dataframe 转换为 numpy 数组会丢失 label 信息,我需要正确找到该列。

Any suggestions?..有什么建议么?..

you can do it using DataFrame.mask or numpy where like below looks like numpy where performs slightly better in this dataset您可以使用DataFrame.mask或 numpy 来做到这一点,如下所示,看起来像 numpy 在此数据集中表现稍好

N = np.arange(1, 10)
df_b = pd.DataFrame({
    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'a':   [   1,    2,    3,    4,    5,    6,    7,    8,    9],
    'b':   [  10,   20,   30,   40,   50,   60,   70,   80,   90],
    'c':   [ 100,  200,  300,  400,  500,  600,  700,  800,  900],
    'd':   [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})

df_b

Using Pandas Where使用 Pandas 在哪里

%%timeit
df = df_b.copy()
cols = df.columns[1:]
df["ind"] = df["ref"]

for col in cols:
    df.ind.mask(df.ind==col, df[col], inplace=True)
df
## 6.73 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using Numpy's Where在哪里使用 Numpy

%%timeit
df = df_b.copy()
arr = df.ref.values

cols = df.columns[1:]
for col in cols:
    arr2 = df[col].values
    arr = np.where(arr==col, arr2, arr)

df["ind"] = arr
df

## 1.21 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Result结果

    ref a   b   c   d   ind
0   a   1   10  100 1000    1
1   b   2   20  200 2000    20
2   c   3   30  300 3000    300
3   d   4   40  400 4000    4000
4   c   5   50  500 5000    500
5   b   6   60  600 6000    60
6   a   7   70  700 7000    7
7   b   8   80  800 8000    80
8   c   9   90  900 9000    900

Use pandas.lookup()使用 pandas.lookup()

df['ind'] = df.lookup(df.index, df['ref'])

  ref  a   b    c     d   ind
0   a  1  10  100  1000     1
1   b  2  20  200  2000    20
2   c  3  30  300  3000   300
3   d  4  40  400  4000  4000
4   c  5  50  500  5000   500
5   b  6  60  600  6000    60
6   a  7  70  700  7000     7
7   b  8  80  800  8000    80
8   c  9  90  900  9000   900

You could use numpy indexing:您可以使用 numpy 索引:

lookup = dict(zip(df.columns, range(len(df.columns))))
result = pd.DataFrame({ 'ref' : df.ref, 'ind': df.values[np.arange(len(df)), df.ref.map(lookup)] })

print(result)

Output Output

  ref   ind
0   a     1
1   b    20
2   c   300
3   d  4000
4   c   500
5   b    60
6   a     7
7   b    80
8   c   900

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM