简体   繁体   中英

Answering business questions with machine learning models (scikit or statsmodels)

Thanks for your help on this.

This feels like a silly question, and I may be overcomplicating things. Some background information - I just recently learned some machine learning methodologies in Python (scikit and some statsmodels), such as linear regression, logistic regression, KNN, etc. I can work the steps of prepping the data in pandas data frames and transforming categorical data to 0's and 1's. I can also load those into a model (like, logistic regression in scikit learn). I know how to train and test it (using CV, etc.), and some fine tuning methods (gridscore, etc.). But this is all in the scope of predicting outcomes on new data. I mainly focused on learning on building a model to predict on new X values, and testing that model to confirm accuracy/precision.

However, now I'm having trouble identifying and executing the steps to the OTHER kinds of questions that say, a regression model, can answer, like:

Why did customer service calls drop last month? Should we go with this promotion model or another one?

Assuming we have all our variables/predictor sets, how would we determine those two questions using any supervised machine learning model, or just a stat model in the statsmodels package.

Hope this makes sense. I can certainly go into more detail.

Why did customer service calls drop last month?

It depends on what type and features of data you have to analyze and explore the data. One of the basic things is to look at correlation between features and target variable to check if you can identify any feature that can correlate with the drop of calls. So exploring different statistic might help better to answer this question than prediction models.

Also it is always a good practice to analyze and explore the data before you even start working on prediction models as its often necessary to improve the data (scaling, removing outliers, missing data etc) depending on the prediction model you chose.

Should we go with this promotion model or another one?

This question can be answered based on the regression or any other prediction models you designed for this data. These models would help you to predict the sales/outcome for the feature if you can provide the input features of the promotion models.

Your question could be seen as too broad, since what you're asking is, in effect, a version of "What should I be modeling?" That said, I will try to offer some thoughts about the question you raise, in case it proves helpful.

Take your first hypothetical as a sample: "Why did customer service calls drop last month?"

First, this assumes that you have a phenomenon that you want to understand (lower customer service calls). In developing any model, you should make sure the question you pose could, in theory, be answered by the model. In this case, the phrasing could be: What factors for which we have good information, led to a decrease in customer service calls last month (as compared to some previous time period).

The phrasing is stilted, but points out the issue: the model is meant as a tool to quantify potential answers to your issue.

What you need, at this point, is to understand why you may include, or exclude, information from the model. Theory is the best guide, even a loose one. Customer calls are a function of what? Number of units sold? Production quality? Clarity of instructions provided with the unit? Also, some of these are functions of other issues: number of units is a function of time of year, marketing, general sales trends, etc.

Let's assume you have identified, and can capture, the features you think are relevant to the outcome of interest: customer service calls. Further, assume you have stored them, cleaned them, processed them, and have a data set ready and waiting.

As stated, you are looking to explain a result you have already seen (the drop in calls). You have innumerable options for models; the selection of type/style is entirely dependent on what you want to know. The way you pose the question, it seems that you might be interested in causal relationships. This is hard to do, since there are always variables you can't capture that may affect what you did capture (confounders), but isn't impossible. Regression models (linear, logistic, max likelihood in general, GLM, 2SLS, etc. and so on) are often good at this, entirely without the need to do the usual train/test steps present in much of ML. (Though, as someone I read somewhere -- reference anyone? -- said, there is no explanation without prediction.) The coefficients you get from these kinds of models can tell you things like which features correlate with increases/decreases in service calls (I refrain from saying "cause" since that requires some very specific conditions. This might be a good starting point for you. )

Or, you might simply be interested in asking "of those features I have captured, which is the most predictive of the service call volume", in which case you have a much more straightforward predictive-model case, where you're simply looking for a really good predictive model. Of course, these are not mutually exclusive. If something is causal, it's often going to be important in a predictive model (the causal effect could be small, of course).

Ultimately, you should familiarize yourself with the interpretations of the coefficients and results that come out of a model indicate for the relationship with the response variable of interest. That will help provide a decent idea about what each model can say about the phenomenon of interest.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM