简体   繁体   中英

To make group by predictions using test & train data, group by multiple columns

Am very new to machine learning, and alogirithms. Am learning the various ml concepts, so please excuse for my ignorance.

I am working on a project, wherein I need to make a prediction for sales rep calls to be made for the future quarter, based on the rep_calls in the last historic data. Am providing herewith a sample dataframe for your reference and to provide suggestions please.

The rep_calls prediction for QTR4, should be based on the rep calls for the CUSTOMER_NUMBER & PRODUCT_ID, that is available for the last 3 quarter.

df = pd.DataFrame({"CUSTOMER_NUMBER": ["CUST1", "CUST1", "CUST1", "CUST1", "CUST1", "CUST1", "CUST1", "CUST1", "CUST1", "CUST2", "CUST2", "CUST2", "CUST2", "CUST2", "CUST2", "CUST2", "CUST3", "CUST3", "CUST3", "CUST4", "CUST4", "CUST4"],
"PRODUCT": ["PRODUCT1", "PRODUCT2", "PRODUCT3", "PRODUCT1", "PRODUCT2", "PRODUCT3", "PRODUCT1", "PRODUCT2", "PRODUCT3", "PRODUCT1", "PRODUCT2", "PRODUCT3", "PRODUCT1", "PRODUCT2", "PRODUCT3", "PRODUCT3", "PRODUCT3", "PRODUCT3", "PRODUCT3", "PRODUCT1", "PRODUCT1", "PRODUCT2"],
"REP_VISITS": ["3", "3", "3", "3", "3", "3", "4", "4", "4", "3", "2", "2", "4", "6", "8", "5", "3", "1", "3", "2", "0", "3"],
"QTR": ["QTR1", "QTR1", "QTR1", "QTR2", "QTR2", "QTR2", "QTR3", "QTR3", "QTR3", "QTR1", "QTR1", "QTR1", "QTR2", "QTR2", "QTR2", "QTR3", "QTR1", "QTR2", "QTR3", "QTR1", "QTR2", "QTR3"],
"START_DATE": ["2020-01-01", "2020-01-01", "2020-01-01", "2020-04-01", "2020-04-01", "2020-04-01", "2020-07-01", "2020-07-01", "2020-07-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-04-01",  "2020-04-01", "2020-04-01","2020-07-01", "2020-01-01", "2020-04-01", "2020-07-01", "2020-01-01", "2020-04-01", "2020-07-01"],
"END_DATE": ["2020-03-31", "2020-03-31", "2020-03-31", "2020-06-30", "2020-06-30", "2020-06-30", "2020-09-30", "2020-09-30", "2020-09-30", "2020-03-31", "2020-03-31", "2020-03-31", "2020-06-30", "2020-06-30", "2020-06-30", "2020-09-30", "2020-03-31", "2020-06-30", "2020-09-30", "2020-03-31", "2020-06-30", "2020-09-30"]})

The dataframe looks as below:

在此处输入图像描述

I need to find out the predicted rep_calls for QTR4.

CUST1|PRODUCT1||QTR4|
CUST1|PRODUCT2||QTR4|
CUST1|PRODUCT3||QTR4|
CUST2|PRODUCT1||QTR4|
CUST2|PRODUCT2||QTR4|
CUST2|PRODUCT3||QTR4|
CUST3|PRODUCT3||QTR4|
CUST4|PRODUCT1||QTR4|
CUST4|PRODUCT2||QTR4|

Please guide me how i can create training dataset for customers/products, with appopriate predictions, so i can use the test_data for predictions/valuations.

I think you could try using customer no. and product id as features and train a simple classifier using Logistic Regression or decision trees. You could try using 1-hot encoding for different customer numbers and product ID's. If you are trying this approach, REP_visits could be the labels and the features could be cust1, cust2,cust3, product1, product2, etc. scikitlearn has implementations of these algorithms, which are easy to use. Hope this helps:

from sklearn.tree import DecisionTreeClassifier 
unique_cust_nos = df['CUSTOMER_NUMBER'].unique()
unique_products = df['PRODUCT'].unique()
features = []
for item in unique_cust_nos:
    features.append(item)
for item in unique_products:
    features.append(item)
for idx, item in df.iterrows():
#     make a dataframe(all_features_df) so that ['CUST1', 'CUST2', 'CUST3', 'CUST4', 'PRODUCT1', 'PRODUCT2', 'PRODUCT3'] are feature columns and rep_visits is the label
X = all_features_df[feature_cols] # Features
y = all_features_df[label] # Target variable
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X,y)
#Predict the response for test dataset
y_pred = clf.predict(X_test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM