[英]Quickly convert Pandas Series of labels into Series of indirect values from corresponding columns
I have following example dataframe:我有以下示例 dataframe:
N = np.arange(1, 10)
df = pd.DataFrame({
'ref': [ 'a', 'b', 'c', 'd', 'c', 'b', 'a', 'b', 'c'],
'a': [ 1, 2, 3, 4, 5, 6, 7, 8, 9],
'b': [ 10, 20, 30, 40, 50, 60, 70, 80, 90],
'c': [ 100, 200, 300, 400, 500, 600, 700, 800, 900],
'd': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})
I want to "dereference" ref
column in some way, to get this:我想以某种方式“取消引用”
ref
列,以获得这个:
'ref': [ 'a', 'b', 'c', 'd', 'c', 'b', 'a', 'b', 'c'],
'ind': [ 1, 20, 300, 4000, 500, 60, 7, 80, 900],
So each value in ind
should correspond to the value in column labeled from ref
at the same position.因此,
ind
中的每个值都应对应于同一 position 中从ref
标记的列中的值。
Naïve approach would be to use something like df[df['ref']]
, then multiply by identity matrix, then sum it column-wise.天真的方法是使用类似
df[df['ref']]
东西,然后乘以单位矩阵,然后按列求和。 But because I have quite large (~8 GB) dataframe, doing this, I guess, would nearly square its size.但是因为我有相当大的(~8 GB) dataframe,所以我猜这样做几乎会成正比。 And it just doesn't feel right.
而且感觉不对劲。
Also due to the size just iterating over it is painfully slow.此外,由于只是迭代它的大小非常缓慢。 And I can't iterate with Cython, because converting this dataframe into numpy array loses label information, which I need to properly find the column.
而且我无法使用 Cython 进行迭代,因为将此 dataframe 转换为 numpy 数组会丢失 label 信息,我需要正确找到该列。
Any suggestions?..有什么建议么?..
you can do it using DataFrame.mask
or numpy where like below looks like numpy where performs slightly better in this dataset您可以使用
DataFrame.mask
或 numpy 来做到这一点,如下所示,看起来像 numpy 在此数据集中表现稍好
N = np.arange(1, 10)
df_b = pd.DataFrame({
'ref': [ 'a', 'b', 'c', 'd', 'c', 'b', 'a', 'b', 'c'],
'a': [ 1, 2, 3, 4, 5, 6, 7, 8, 9],
'b': [ 10, 20, 30, 40, 50, 60, 70, 80, 90],
'c': [ 100, 200, 300, 400, 500, 600, 700, 800, 900],
'd': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
})
df_b
Using Pandas Where使用 Pandas 在哪里
%%timeit
df = df_b.copy()
cols = df.columns[1:]
df["ind"] = df["ref"]
for col in cols:
df.ind.mask(df.ind==col, df[col], inplace=True)
df
## 6.73 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using Numpy's Where在哪里使用 Numpy
%%timeit
df = df_b.copy()
arr = df.ref.values
cols = df.columns[1:]
for col in cols:
arr2 = df[col].values
arr = np.where(arr==col, arr2, arr)
df["ind"] = arr
df
## 1.21 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Result结果
ref a b c d ind
0 a 1 10 100 1000 1
1 b 2 20 200 2000 20
2 c 3 30 300 3000 300
3 d 4 40 400 4000 4000
4 c 5 50 500 5000 500
5 b 6 60 600 6000 60
6 a 7 70 700 7000 7
7 b 8 80 800 8000 80
8 c 9 90 900 9000 900
Use pandas.lookup()使用 pandas.lookup()
df['ind'] = df.lookup(df.index, df['ref'])
ref a b c d ind
0 a 1 10 100 1000 1
1 b 2 20 200 2000 20
2 c 3 30 300 3000 300
3 d 4 40 400 4000 4000
4 c 5 50 500 5000 500
5 b 6 60 600 6000 60
6 a 7 70 700 7000 7
7 b 8 80 800 8000 80
8 c 9 90 900 9000 900
You could use numpy indexing:您可以使用 numpy 索引:
lookup = dict(zip(df.columns, range(len(df.columns))))
result = pd.DataFrame({ 'ref' : df.ref, 'ind': df.values[np.arange(len(df)), df.ref.map(lookup)] })
print(result)
Output Output
ref ind
0 a 1
1 b 20
2 c 300
3 d 4000
4 c 500
5 b 60
6 a 7
7 b 80
8 c 900
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.