简体   繁体   English

有没有办法从列表列表数组的字典中制作 dataframe?

[英]Is there a way to make a dataframe from a dictionary of lists of lists array?

I am currently trying to build a rolling window timeseries for Principal Component Analysis for stock returns to build a backtest.我目前正在尝试为股票收益的主成分分析建立一个滚动的 window 时间序列以建立一个回溯测试。 I want to see if setting the weights of the respective assets overtime perform better (stronger returns) than a buy and hold portfolio.我想看看设置各个资产的权重是否比买入并持有的投资组合表现更好(更强的回报)。 The problem is its quite difficult building a timeseries for the components weights that correspond with the PCA.问题是很难为与 PCA 相对应的分量权重构建时间序列。 I came up with somewhat of a fix but cannot seem to build a timeseries for this data.我想出了一些解决办法,但似乎无法为这些数据建立时间序列。 I am also struggling replacing the key values in the dictionary with the datetime series.我也在努力用日期时间系列替换字典中的键值。 Have looked around, tried most on stack overflow but to no avail.环顾四周,对堆栈溢出进行了大部分尝试,但无济于事。

The below code is what I have come up with:下面的代码是我想出的:

import numpy as np
np.random.seed(42)

values_for_df = []
for i in range(1,6):    
    random_numbers = np.random.random(size=60)
    values_for_df.append(random_numbers)

df = pd.DataFrame(values_for_df).T

weights = {}
dates_1 = {}
for i in range(1, len(df)):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[i:i+2])
    weights[i] = pca.components_
    dates_1[i] = df.iloc[i].name

The output is a dictionary of lists of lists. output 是一个列表列表的字典。 As indicated, I am having a hard time turning this into a df using either pd.DataFrame() and pd.concat() .如前所述,我很难使用pd.DataFrame()pd.concat()将其转换为 df。

Anyway to turn this into output into a dataframe where the two PCA component weights rows correspond to a datetime?无论如何把它变成 output 变成 dataframe 其中两个 PCA 组件权重行对应于一个日期时间?

The output of this code looks like this:此代码的 output 如下所示:

{1: array([[ 0.50938649,  0.1163777 , -0.56213712, -0.5999693 , -0.2258768 ],
        [-0.19623229,  0.68084356,  0.17894347,  0.05397575, -0.68044896]]),
 2: array([[ 0.76101188, -0.39708989, -0.35225525, -0.01460473, -0.37267074],
        [ 0.26603362, -0.44559758,  0.81939468,  0.00324688,  0.24341472]]),
 3: array([[ 0.43735771,  0.07284643, -0.23807945,  0.46456192, -0.72863711],
        [-0.84990214, -0.03839851,  0.14762177,  0.40008466, -0.30713514]]),
 4: array([[-0.10002177, -0.12908589,  0.09697811, -0.54718954,  0.81517565],
        [ 0.9291778 ,  0.24487735, -0.15042811, -0.23197187,  0.01497085]]),
 5: array([[ 0.43260558, -0.17245194, -0.15363331,  0.64845393, -0.58225171],
        [-0.8998753 , -0.03170306, -0.04644508,  0.3319358 , -0.27727395]]),
 6: array([[-0.66851419,  0.31545065, -0.26741055, -0.54749379,  0.28691779],
        [ 0.3598592 , -0.05698951, -0.01176088,  0.02137245,  0.93094492]]),
 7: array([[ 0.69949617, -0.46121291,  0.26456096,  0.47439289,  0.05428297],
        [ 0.0671515 , -0.02046416,  0.07459749, -0.04681467, -0.99363751]]),
 8: array([[ 0.76526418, -0.23880119, -0.57563869, -0.12170626, -0.10569961],
        [-0.20948119, -0.96814145,  0.11706768, -0.02831197,  0.06567612]]),
 9: array([[ 0.88308511, -0.18178186,  0.23418943,  0.05558346,  0.35941875],
        [ 0.3864688 , -0.20776523, -0.30713553, -0.12458004, -0.83523832]]),
 10: array([[ 0.02145911,  0.17212618, -0.34312327, -0.91962789,  0.08039307],
        [-0.93784872,  0.14547558,  0.22919403, -0.09705987, -0.19319965]]),
 11: array([[-0.28946201, -0.26603042,  0.62500451, -0.66932375,  0.08255082],
        [-0.79432192, -0.0826848 ,  0.20253363,  0.56666393,  0.0093821 ]]),
 12: array([[ 0.4225668 ,  0.63454067, -0.52748616,  0.37344672, -0.0330355 ],
        [ 0.89717194, -0.19965603,  0.28582373, -0.27012438,  0.02361333]]),
 13: array([[-0.09152907,  0.18236668, -0.43896889,  0.65056049, -0.5851856 ],
        [ 0.19225542,  0.02507023, -0.12112356, -0.68443942, -0.69230131]]),
 14: array([[ 0.52763656,  0.65909855, -0.10621454, -0.26420703,  0.45398444],
        [-0.20903038, -0.39874697,  0.03275961,  0.0985442 ,  0.88686132]]),
 15: array([[-0.6376942 , -0.65434659,  0.23591625, -0.20141987,  0.26258372],
        [-0.26207514, -0.31149866, -0.55752568,  0.44580663, -0.56983048]]),
 16: array([[ 0.27907902,  0.33000177, -0.37818218, -0.21758258, -0.78920833],
        [ 0.49977863, -0.51086522,  0.39388011, -0.57407273, -0.06735727]]),
 17: array([[-0.07747888, -0.44363775,  0.72389959,  0.51430407,  0.09296923],
        [-0.44632809, -0.37360701, -0.37433274,  0.2591633 , -0.67373468]]),
 18: array([[-0.24853706, -0.28143494, -0.09349904,  0.91280228,  0.13066607],
        [-0.90863048, -0.25882281,  0.08144532, -0.30767452, -0.07813102]]),
 19: array([[-0.0499767 , -0.46808766,  0.81593976,  0.32495903,  0.08390597],
        [-0.18009682,  0.19879004, -0.0864013 ,  0.65630871, -0.69988667]]),
 20: array([[-0.15978936,  0.40505628,  0.23403331,  0.27166524,  0.82572585],
        [-0.82190218, -0.11639043, -0.04382051, -0.54840546,  0.09089168]]),
 21: array([[-0.59793074,  0.36403396,  0.28523106, -0.56614702,  0.32875356],
        [ 0.08787018, -0.09763207,  0.94862929,  0.27707493, -0.07796638]]),
 22: array([[-0.04762231, -0.48706884,  0.45248363,  0.37215567, -0.64595262],
        [ 0.44614193,  0.47456984,  0.55381454, -0.44821569, -0.26102299]]),
 23: array([[-6.14200977e-01, -8.29742681e-02,  1.70228332e-01,
         -7.64025699e-01,  5.62092342e-02],
        [ 9.85466281e-02, -7.29776513e-01, -4.55585357e-04,
         -4.97073250e-02, -6.74717553e-01]]),

When attempting to create a df, I get this:尝试创建 df 时,我得到以下信息:

    weights_keys    weights_values
0   1   [[0.5093864920875057, 0.11637769781544054, -0....
1   2   [[0.7610118804227364, -0.3970898897595845, -0....
2   3   [[0.43735770537072516, 0.07284642654346118, -0...
3   4   [[-0.100021766544103, -0.12908589345836016, 0....
4   5   [[0.43260557607788175, -0.17245193633756645, -...
5   6   [[-0.6685141891902584, 0.3154506469430627, -0....
6   7   [[0.6994961703309339, -0.4612129082876791, 0.2...
7   8   [[0.7652641817892236, -0.23880119387494167, -0...
8   9   [[0.8830851102283364, -0.18178185688401122, 0....
9   10  [[0.02145910731659373, 0.17212617677552292, -0...
10  11  [[-0.28946201366547714, -0.2660304245115253, 0...
11  12  [[0.42256679812505826, 0.6345406677421921, -0....
12  13  [[-0.09152906655393278, 0.1823666758882022, -0...
13  14  [[0.5276365649456491, 0.6590985509896493, -0.1...
14  15  [[-0.6376941956390323, -0.6543465915749572, 0....
15  16  [[0.27907901752772, 0.33000177354673366, -0.37...
16  17  [[-0.07747887772273652, -0.44363774912889514, ...

An example of what the dataframe should look like is this: dataframe 的示例如下所示:

        USDJPY  EURUSD  GBPUSD  AUDUSD  GBPAUD
20210924 21:00:00   Component weights 1 1.618764e-09    -5.137869e-10   -7.915763e-10   -6.841845e-10   4.352906e-10
Component weights 2 -5.137869e-10   1.900899e-09    9.721030e-10    1.872090e-09    -4.564939e-10
Component weights 3 -7.915763e-10   9.721030e-10    3.363203e-09    3.988530e-09    9.450517e-10
Component weights 4 -6.841845e-10   1.872090e-09    3.988530e-09    1.277432e-08    -2.272119e-09
Component weights 5 4.352906e-10    -4.564939e-10   9.450517e-10    -2.272119e-09   7.960307e-09
... ... ... ... ... ... ...
20210924 21:59:00   Component weights 1 1.618764e-09    -5.137869e-10   -7.915763e-10   -6.841845e-10   4.352906e-10
Component weights 2 -5.137869e-10   1.900899e-09    9.721030e-10    1.872090e-09    -4.564939e-10
Component weights 3 -7.915763e-10   9.721030e-10    3.363203e-09    3.988530e-09    9.450517e-10
Component weights 4 -6.841845e-10   1.872090e-09    3.988530e-09    1.277432e-08    -2.272119e-09
Component weights 5 4.352906e-10    -4.564939e-10   9.450517e-10    -2.272119e-09   7.960307e-09

The above df is an example of a PCA created with n_components = 5上面的 df 是使用 n_components = 5 创建的 PCA 示例

It is not clear what the final output looks like.不清楚最后的 output 长什么样子。 I am taking an guess.我在猜测。

weights = {}
dates_1 = {}
for i in range(1, len(df)):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[i:i+2])
    weights[i] = pca.components_.tolist()
    dates_1[i] = df.iloc[i].name

df1 = pd.DataFrame(dates_1.items(), columns=['dates_keys', 'dates_values'])
df2 = pd.DataFrame(weights.items(), columns=['weights_keys', 'weights_values'])

df = df1.merge(df2, left_on='dates_keys', right_on='weights_keys')
df[['pca1', 'pca2']] = pd.DataFrame(df['weights_values'].tolist())
df.drop('weights_values', axis=1, inplace=True)
print(df.head(2))

Does this solve your problem?这能解决您的问题吗?

Following @HoneyBeer's response above, a df can be created as below:按照上面@HoneyBeer 的回复,可以创建一个 df,如下所示:

df3 = []
for i in range(0, len(weights)):
    new_df = pd.DataFrame(df['weights_values'][i].tolist())
    df3.append(new_df)

final_df = pd.concat(df3, keys=returns.index).rename(index={0:'Component weights 1',                           
                                                   1: 'Component weights 2'}), columns={0:'USDJPY',
                                                                                       1: 'EURUSD', 
                                                                                       2: 'GBPUSD',
                                                                                       3: 'AUDUSD',
                                                                                   4: 'GBPAUD'})

The result is this:结果是这样的:

        USDJPY  EURUSD  GBPUSD  AUDUSD  GBPAUD
Date                        
20210924 21:00:00   Component weights 1 -0.138952   -0.149062   0.547648    -0.264848   0.767079
Component weights 2 -0.934455   0.048407    0.125520    -0.140824   -0.298100
20210924 21:01:00   Component weights 1 0.149391    0.255187    -0.094000   -0.653122   0.690766
Component weights 2 0.427402    -0.215456   0.255242    -0.621257   -0.565506
20210924 21:02:00   Component weights 1 -0.214539   0.192370    -0.134088   0.146269    -0.936799
... ... ... ... ... ... ...
20210924 21:56:00   Component weights 1 0.002072    0.409711    -0.598962   -0.486351   -0.486662
Component weights 2 -0.079410   0.416419    -0.490364   0.726674    0.227546
20210924 21:57:00   Component weights 1 -0.287978   -0.138368   0.623330    0.679409    0.218598
Component weights 2 0.060904    0.070058    0.550906    -0.206938   -0.803157
20210924 21:58:00   Component weights 1 1.000000    0.000000    0.000000    0.000000    0.000000

or in picture form:或图片形式:

Weights in picture form图片形式的权重

Here is what I think you are trying to do.这就是我认为你正在尝试做的事情。 You have a timeseries consisting of 60 sampling intervals.您有一个包含 60 个采样间隔的时间序列。 For the purposes of this answer, I will assume the interval is 1 day, so you have 60 days in the timeseries.出于此答案的目的,我假设间隔为 1 天,因此时间序列中有 60 天。 I also think you have 5 data columns for the timeseries.我还认为您有 5 个时间序列数据列。 So your input is something like所以你的输入是这样的

date日期 var1变种1 var2变量2 var3变量3 var4变量4 var5变量5
2018-04-24 2018-04-24 1.1 1.1 2.2 2.2 2.5 2.5 3.5 3.5 3.3 3.3
2018-04-25 2018-04-25 1.0 1.0 2.3 2.3 3.9 3.9 8.7 8.7 2.7 2.7
2018-04-26 2018-04-26 0.9 0.9 2.7 2.7 4.0 4.0 6.5 6.5 4.6 4.6

You are then calculating principal components for a sliding 2-day window. You want to combine all of these PCA results into a single data frame.然后,您要计算 2 天滑动 window 的主成分。您希望将所有这些 PCA 结果组合到一个数据框中。

For combining all PCA results together, you can use a MultiIndex要将所有 PCA 结果组合在一起,您可以使用MultiIndex

Here is a full working example.这是一个完整的工作示例。 You can configure the window length, number of dates, number of columns in data, and the number of return PCA components by setting window_length , n_dates , n_columns , and n_pca_components , respectively.您可以通过分别设置window_lengthn_datesn_columnsn_pca_components来配置 window 长度、日期数、数据列数和返回 PCA 组件数。 Note that n_pca_components must be less than or equal to n_columns .请注意, n_pca_components必须小于或等于n_columns

import numpy as np                                                                                               
import pandas as pd
from sklearn.decomposition import PCA
rng = np.random.default_rng(42)

values_for_df = []
n_dates = 60
window_length = 2
n_columns = 5
n_pca_components = 5
n_windows = n_dates - window_length + 1

for i in range(n_columns):
    random_numbers = rng.random(size=n_dates)
    values_for_df.append(random_numbers)

df = pd.DataFrame(values_for_df).T
dates = pd.date_range(start="2018-04-24", periods=n_windows)
pca_component_labels = [f"weights_{i+1}" for i in range(window_length)]
my_index = pd.MultiIndex.from_product(
    (dates, pca_component_labels),
    names=["date", "pca_component"]
)
weights = []
for i in range(n_windows):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[i:i+window_length])
    weights.append(pca.components_[:, :n_pca_components)

timeseries_pca = pd.DataFrame(
    np.concatenate(weights),
    index=my_index,
    columns=[f"PCA{i+1}" for i in range(n_pca_components)]
)
timeseries_pca

This is the result.这是结果。

                              PCA1      PCA2      PCA3      PCA4      PCA5
date       pca_component
2018-04-24 weights_1     -0.846738 -0.498592  0.166146  0.006707 -0.082409
           weights_2     -0.530398  0.817748 -0.206418 -0.009032  0.085296
2018-04-25 weights_1      0.427636  0.095916 -0.576067 -0.381654 -0.574818
           weights_2      0.875506 -0.012782  0.117490  0.117209  0.453634
2018-04-26 weights_1      0.291577 -0.361262 -0.599255  0.366988 -0.539153
...                            ...       ...       ...       ...       ...
2018-06-19 weights_2      0.468190  0.773792 -0.259101  0.301738 -0.154483
2018-06-20 weights_1      0.461097  0.308911  0.108246  0.824414  0.024239
           weights_2      0.172187  0.429880  0.585669 -0.317087 -0.584809
2018-06-21 weights_1     -0.085999  0.664442 -0.486954  0.310776 -0.466278
           weights_2     -0.128612  0.040888  0.719771 -0.016595 -0.680765

[118 rows x 5 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM