简体   繁体   中英

Correlation of columns across time series

I'm trying to understand the correlation of sales activity to closed orders.

So, for example, sales activities in January lead to a certain number of opportunities in February, which leads to a certain number of orders being won in March. The difficulty that I'm having is that there is not always a one month lag between activity/opportunity/won order. It seems to me that pandas .corr wants to use specific data sets, but that is one of my unknowns and one of the things that I am trying to understand. The other difficulty is the scales. Calls are measured in number of calls. Opportunities and won orders are measured in dollars. So my question is this, is there a way to best fit data from different columns so that I can apply a correlation?

import pandas as pd

d = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
     'Year': [2019, 2019, 2019, 2019, 2019],
    'CallsActivity': [10, 20, 30, 40, 50],
    'NewOpportunitiesRevenue': [0, 5000, 10000, 15000, 20000],
    'WonOpportunitiesRevenue': [0, 0, 1000, 2000, 3000]
}
df = pd.DataFrame(data=d)

I would want this to show up as something like the following:

correlation_d = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
     'Year': [2019, 2019, 2019, 2019, 2019],
    'CallsActivity': [10, 20, 30, 40, 50],
    'NewOpportunitiesRevenue': [5000, 10000, 15000, 20000, 'NaN'],
    'WonOpportunitiesRevenue': [1000, 2000, 3000, 'NaN', 'NaN']
}
correlation_df = pd.DataFrame(data=correlation_d)

input(correlation_df)

I can get the correlation to work if I manually move the columns around in this simple example, but I don't know where to begin on automating that part of my study for my actual dataset. I appreciate an insight into this.

Thanks.

If I interpret your question to mean: "how do I choose the ideal lag amount for each column automatically?" Then what you could do is: Do a loop that: 1) calculates the correlation between two columns, 2) Compares that correlation to the latest maximum correlation that has been seen, and if the new correlation is greater than the latest maximum, update the maximum to contain the new correlation that was found(since it is larger), otherwise keep the maximum as it is. Also record the column row shift(lag), of the latest maximum every time a new maximum is found. 3) Shift one of the two columns up/down by X many rows. 4)go back to top of the loop.

The loop should keep going until you cannot shift the column up/down anymore and you have explored all reasonable lags. Then you will have the maximum correlation observed and the shift amount(lag) that gives it. it is very important that you start with a large X , so that the algorithm works quickly and tweak the X to be smaller and smaller to tradeoff accuracy with speed.

I believe this 14 minute video might also help you. It will teach you how to do rolling statistics and rolling functions so that you can automate creating new rows based on a function and existing rows: Rolling statistics - p.11 Data Analysis with Python and Pandas Tutorial

However, I am not sure whether or not you are having trouble with shifting the columns up and down by a given lag amount, automatically for all columns, or if you are having trouble with deciding the ideal lag amount to begin with for each column? I would ask you this via a comment except I dont have enough reputation points to do so just yet...

Edit: You can also use pandas.rolling_corr() if you want to calculate the correlation for a "rolling window"(a subsample that keeps being moved along the data). But I believe you would still need to shift the data yourself in a loop to find the best lag. To shift the data use the slicer notation df['1st Column name'][Shift_variable: ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM