快速将 Pandas 系列标签转换为对应列的间接值系列

Question

I have following example dataframe:我有以下示例 dataframe：

N = np.arange(1, 10)
df = pd.DataFrame({
    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'a':   [   1,    2,    3,    4,    5,    6,    7,    8,    9],
    'b':   [  10,   20,   30,   40,   50,   60,   70,   80,   90],
    'c':   [ 100,  200,  300,  400,  500,  600,  700,  800,  900],
    'd':   [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})

I want to "dereference" ref column in some way, to get this:我想以某种方式“取消引用” ref列，以获得这个：

    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'ind': [   1,   20,  300, 4000,  500,   60,    7,   80,  900],

So each value in ind should correspond to the value in column labeled from ref at the same position.因此， ind中的每个值都应对应于同一 position 中从ref标记的列中的值。

Naïve approach would be to use something like df[df['ref']] , then multiply by identity matrix, then sum it column-wise.天真的方法是使用类似df[df['ref']]东西，然后乘以单位矩阵，然后按列求和。 But because I have quite large (~8 GB) dataframe, doing this, I guess, would nearly square its size.但是因为我有相当大的（~8 GB） dataframe，所以我猜这样做几乎会成正比。 And it just doesn't feel right.而且感觉不对劲。

Also due to the size just iterating over it is painfully slow.此外，由于只是迭代它的大小非常缓慢。 And I can't iterate with Cython, because converting this dataframe into numpy array loses label information, which I need to properly find the column.而且我无法使用 Cython 进行迭代，因为将此 dataframe 转换为 numpy 数组会丢失 label 信息，我需要正确找到该列。

Any suggestions?..有什么建议么？..

Answer 1

you can do it using DataFrame.mask or numpy where like below looks like numpy where performs slightly better in this dataset您可以使用DataFrame.mask或 numpy 来做到这一点，如下所示，看起来像 numpy 在此数据集中表现稍好

N = np.arange(1, 10)
df_b = pd.DataFrame({
    'ref': [ 'a',  'b',  'c',  'd',  'c',  'b',  'a',  'b',  'c'],
    'a':   [   1,    2,    3,    4,    5,    6,    7,    8,    9],
    'b':   [  10,   20,   30,   40,   50,   60,   70,   80,   90],
    'c':   [ 100,  200,  300,  400,  500,  600,  700,  800,  900],
    'd':   [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})

df_b

Using Pandas Where使用 Pandas 在哪里

%%timeit
df = df_b.copy()
cols = df.columns[1:]
df["ind"] = df["ref"]

for col in cols:
    df.ind.mask(df.ind==col, df[col], inplace=True)
df
## 6.73 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using Numpy's Where在哪里使用 Numpy

%%timeit
df = df_b.copy()
arr = df.ref.values

cols = df.columns[1:]
for col in cols:
    arr2 = df[col].values
    arr = np.where(arr==col, arr2, arr)

df["ind"] = arr
df

## 1.21 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Result结果

    ref a   b   c   d   ind
0   a   1   10  100 1000    1
1   b   2   20  200 2000    20
2   c   3   30  300 3000    300
3   d   4   40  400 4000    4000
4   c   5   50  500 5000    500
5   b   6   60  600 6000    60
6   a   7   70  700 7000    7
7   b   8   80  800 8000    80
8   c   9   90  900 9000    900

Answer 2

Use pandas.lookup()使用 pandas.lookup()

df['ind'] = df.lookup(df.index, df['ref'])

  ref  a   b    c     d   ind
0   a  1  10  100  1000     1
1   b  2  20  200  2000    20
2   c  3  30  300  3000   300
3   d  4  40  400  4000  4000
4   c  5  50  500  5000   500
5   b  6  60  600  6000    60
6   a  7  70  700  7000     7
7   b  8  80  800  8000    80
8   c  9  90  900  9000   900

Answer 3

You could use numpy indexing:您可以使用 numpy 索引：

lookup = dict(zip(df.columns, range(len(df.columns))))
result = pd.DataFrame({ 'ref' : df.ref, 'ind': df.values[np.arange(len(df)), df.ref.map(lookup)] })

print(result)

Output Output

  ref   ind
0   a     1
1   b    20
2   c   300
3   d  4000
4   c   500
5   b    60
6   a     7
7   b    80
8   c   900

快速将 Pandas 系列标签转换为对应列的间接值系列

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-10-12 16:40:19

解决方案2
0 2019-10-12 16:44:00

解决方案3
0 2019-10-12 16:52:59

快速将 Pandas 系列标签转换为对应列的间接值系列

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-10-12 16:40:19

解决方案2 0 2019-10-12 16:44:00

解决方案3 0 2019-10-12 16:52:59

解决方案1
1 已采纳 2019-10-12 16:40:19

解决方案2
0 2019-10-12 16:44:00

解决方案3
0 2019-10-12 16:52:59