[英]Python / pandas: Fastest way to set and retrieve data, without chained assaigment
I am doing som routines that acces scalars and vectors from a pandas dataframe, and then sets the results after some calculations. 我正在执行som例程,该例程从熊猫数据帧访问标量和向量,然后在进行一些计算后设置结果。
Initially I used the form df[var][index] to do this, but encountered problems with chained assaignment ( http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy ) 最初,我使用df [var] [index]的形式来执行此操作,但是遇到链式宣判问题( http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy )
So I change it to use the df.loc[index,var]. 因此,我将其更改为使用df.loc [index,var]。 Which solved the view/copy problem but it is very slow.
这解决了视图/复制问题,但是速度很慢。 For arrays I convert it to a pandas series and uses the builtin df.update().
对于数组,我将其转换为pandas系列,并使用内置的df.update()。 I am now searching for the fastest/best way of doing this, without having to worry about chained assaingment.
我现在正在寻找最快/最好的方法来执行此操作,而不必担心连锁分析。 In the documentation they say that for example df.at[] is the quickest way to access scalars.
他们在文档中说,例如df.at []是访问标量的最快方法。 Does anyone have any experience with this ?
有人对这个有经验么 ? Or can point at some literature that can help ?
还是可以指出一些可以提供帮助的文献?
Thanks 谢谢
Edit: Code looks like this, which I think is pretty standard. 编辑:代码看起来像这样,我认为这很标准。
def set_var(self,name,periode,value):
try:
if navn.upper() not in self.data:
self.data[name.upper()]=num.NaN
self.data.loc[periode,name.upper()]=value
except:
print('Fail to set'+navn])
def get_var(self,navn,periode):
''' Get value '''
try:
value=self.data.loc[periode,navn.upper()]
def set_series(data, index):
outputserie=pd.Series(data,index)
self.data.update(outputserie)
dataframe looks like this:
SC0.data
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 148 entries, 1980Q1 to 2016Q4
Columns: 3111 entries, CAP1 to CHH_DRD
dtypes: float64(3106), int64(2), object(3)
edit2: 编辑2:
a df could look like df可能看起来像
var var1
2012Q4 0.462015 0.01585
2013Q1 0.535161 0.01577
2013Q2 0.735432 0.01401
2013Q3 0.845959 0.01638
2013Q4 0.776809 0.01657
2014Q1 0.000000 0.01517
2014Q2 0.000000 0.01593
and I basically want to perform two operations: 我基本上想执行两个操作:
1) perhaps update var1 with the same scalar over all periodes 1)可能在所有期间都用相同的标量更新var1
2) solve var in 2014Q1 as var,2013Q4 = var1,2013Q3/var2013Q4*var,2013Q4 2)将2014Q1中的var解析为var,2013Q4 = var1,2013Q3 / var2013Q4 * var,2013Q4
This is done as part of a bigger model setup, which is read from a txt file. 这是从txt文件读取的更大模型设置的一部分。 Since I doing loads of these calculations, the speed og setting and reading data matter
由于我要进行这些计算,因此速度设置和读取数据很重要
The example you gave above can be vectorized. 您上面给出的示例可以向量化。
In [3]: df = DataFrame(dict(A = np.arange(10), B = np.arange(10)),index=pd.period_range('2012',freq='Q',periods=10))
In [4]: df
Out[4]:
A B
2012Q1 0 0
2012Q2 1 1
2012Q3 2 2
2012Q4 3 3
2013Q1 4 4
2013Q2 5 5
2013Q3 6 6
2013Q4 7 7
2014Q1 8 8
2014Q2 9 9
Assign a scalar 分配标量
In [5]: df['A'] = 5
In [6]: df
Out[6]:
A B
2012Q1 5 0
2012Q2 5 1
2012Q3 5 2
2012Q4 5 3
2013Q1 5 4
2013Q2 5 5
2013Q3 5 6
2013Q4 5 7
2014Q1 5 8
2014Q2 5 9
Perform a shifted operation 执行轮班操作
In [8]: df['C'] = df['B'].shift()/df['B'].shift(2)
In [9]: df
Out[9]:
A B C
2012Q1 5 0 NaN
2012Q2 5 1 NaN
2012Q3 5 2 inf
2012Q4 5 3 2.000000
2013Q1 5 4 1.500000
2013Q2 5 5 1.333333
2013Q3 5 6 1.250000
2013Q4 5 7 1.200000
2014Q1 5 8 1.166667
2014Q2 5 9 1.142857
Using a vectorized assignment 使用向量化分配
In [10]: df.loc[df['B']>5,'D'] = 'foo'
In [11]: df
Out[11]:
A B C D
2012Q1 5 0 NaN NaN
2012Q2 5 1 NaN NaN
2012Q3 5 2 inf NaN
2012Q4 5 3 2.000000 NaN
2013Q1 5 4 1.500000 NaN
2013Q2 5 5 1.333333 NaN
2013Q3 5 6 1.250000 foo
2013Q4 5 7 1.200000 foo
2014Q1 5 8 1.166667 foo
2014Q2 5 9 1.142857 foo
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.