简体   繁体   English

Python 中的一对多时间序列相关性非常高

[英]One-to-many time series correlation in Python with very high dimensions

My database contains 1 million unique terms that were typed into my website's search box.我的数据库包含输入到我网站的搜索框中的 100 万个唯一术语。

It currently contains two columns: "search term" (the user request) and "volume" (the number of requests for the search term made in a given month).它目前包含两列:“搜索词”(用户请求)和“量”(给定月份对搜索词的请求数)。 The database is partitioned into monthly tables for the last 10 years.该数据库被划分为过去 10 年的月度表。 The mean volume is 18 per month.平均数量为每月 18 个。 Some searches are missing some month partitions if they were not requested by any users.如果没有任何用户请求,某些搜索会丢失一些月份分区。

I wish to be able to analyse any single term to quickly identify its top n most meaningful, correlated terms using python.我希望能够使用 python 分析任何单个术语以快速识别其前 n 个最有意义、相关的术语。

Due to its size, generating an entire correlation matrix would be wasteful in terms of memory and CPU.由于它的大小,生成一个完整的相关矩阵在 memory 和 CPU 方面是浪费的。

What dataframe structure and function would be best suited to this one-to-many comparison in python? dataframe 结构和 function 最适合 python 中的这种一对多比较? And would this function require any detrending to be carried out?这个 function 是否需要进行任何去趋势化?

You could build the full correlation matrix every month or perhaps not full but only taking a list of interesting terms on a few-to-all approach.您可以每个月构建完整的相关矩阵,或者可能不完整,但仅以少数对全部的方法列出有趣的术语。 That way you have the stats saved on file.这样,您就可以将统计信息保存在文件中。

If you choose to get the one-to-all correlation on demand you can at least build a DataFrame that woks as a cache, by storing the result each time you calculate the correlations of one term.如果您选择按需获得一对多的相关性,您至少可以构建一个用作缓存的 DataFrame,方法是在每次计算一个术语的相关性时存储结果。

In order to compute the correlation of one term to all other you can use DataFrame.corrwith :为了计算一个术语与所有其他术语的相关性,您可以使用DataFrame.corrwith

Say you have the following df:假设您有以下df:

import string

terms_list = [''.join((a, b, c))
            for a in string.ascii_lowercase[:25]
            for b in string.ascii_lowercase[:20]
            for c in string.ascii_lowercase[:20]]
np.random.seed(1)
df = pd.Series(
    np.random.choice(list(np.arange(10, 26)) + [np.nan], int(120e4)),
    index = pd.MultiIndex.from_product([terms_list, range(120)],
        names=['term', 'month'])
    )
df = df.dropna().unstack()
pivot_term = terms_list[0]

print(df)打印(df)

aaa    15.0  21.0  22.0  18.0  19.0  21.0  15.0  ...   NaN  15.0  23.0  11.0  20.0  10.0  17.0
aab    10.0  24.0  23.0  21.0  16.0  23.0  25.0  ...   NaN  15.0  12.0  11.0  21.0  15.0  19.0
aac    21.0  11.0  10.0  17.0  10.0  12.0  13.0  ...  10.0  10.0  25.0  14.0  20.0  22.0  15.0
aad     NaN  10.0  21.0  22.0  21.0  13.0  22.0  ...  11.0  17.0  12.0  14.0  15.0  17.0  22.0
aae    23.0  10.0  17.0  25.0  19.0  11.0  11.0  ...  10.0  25.0  18.0  16.0  10.0  16.0  11.0
...     ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...
ytp    24.0  18.0  16.0  23.0   NaN  19.0  18.0  ...  20.0  15.0  21.0  11.0  14.0  18.0  19.0
ytq    22.0  11.0  17.0  24.0  12.0  20.0  17.0  ...  16.0   NaN  13.0  13.0  18.0  22.0  15.0
ytr    22.0  19.0  20.0  11.0  10.0  20.0  14.0  ...  24.0  21.0   NaN  19.0  10.0  24.0  22.0
yts    22.0   NaN  22.0  17.0  14.0  14.0  25.0  ...  14.0  22.0   NaN  23.0  14.0  25.0  10.0
ytt    17.0  16.0  15.0  21.0  11.0  19.0  16.0  ...  10.0  19.0  19.0  13.0  21.0  18.0  16.0

[10000 rows x 120 columns]

the code编码

t1 = time()
max_periods = 120
df = df.iloc[:, -max_periods:]
### get correlations
corr = df.drop(pivot_term, axis=0).corrwith(df.loc[pivot_term], axis=1)
t1 = time() - t1
print(corr)
print(t1)

Output Output

term
aab    0.045972
aac    0.064941
aad   -0.057009
aae   -0.187645
aaf   -0.075473
         ...
ytp    0.103756
ytq   -0.054769
ytr   -0.115004
yts    0.123223
ytt    0.230628
Length: 9999, dtype: float64
9.76

From here you can filter interesting terms with corr.nlargest or corr.nsmallest .从这里您可以使用corr.nlargestcorr.nsmallest过滤有趣的术语。

PS附言

You might also want to look into a smaller datatype that still fits the maximum volume per month, say np.int16 .您可能还想研究一个仍然适合每月最大数量的较小数据类型,例如np.int16

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM