简体   繁体   中英

Machine Learning - How to predict set of fixed fields based on past features

I have quite a large dataset (> 100k rows), which contains information for logistical shipments. (export shipments)

The dataset looks like this:

|shipper|consignee                    |origin|destination                                  |
|-------|-----------------------------|------|---------------------------------------------|
|6409292|288882                       |USSFO |CNPVG                                        |
|6409292|288882                       |USSFO |CNPVG                                        |
|6409292|182724                       |USSFO |HKHKG                                        |
|6409292|182724                       |USSFO |HKHKG                                        |
|8201922|948292                       |USSFO |FRCDG                                        |
|8201922|948292                       |USSFO |FRCDG                                        |
|8201922|948292                       |USSFO |FRNIC                                        |
|8201922|291222                       |USEWR |AEDXB                                        |

So what we have here is a list of past shipments. It shows the relationship between shipper and consignee, and from where the shipment was from and where it was sent to.

Based on this past data, I wish to be able to predict when a new shipment is added by looking at the consignee code and origin .

Example

Take below new booking as an example:

|shipper|consignee                    |origin|destination                                  |
|-------|-----------------------------|------|---------------------------------------------|
|1234567|948292                       |USMOB |?                                            |

How can I train a model to predict the destination ? And what is this area in ML referred to?

Before diving into machine learning, understanding the concepts is important:

  • Dataset: This is your collection of data that contains columns and a target column wich we want to predict.

  • Problem type: This is the problem that we are facing. Please check the following link that explains more about it: problem types .

  • Metric: This is to evaluate the performance of our model, and you have to choose one in order to evaluate it properly. For example, if you have True or False you may want to be penalized everytime your model makes a mistake as if he goes for True as answer he may get 50% right and thats a model with 0.5 accuracy wich is not correct as it only answers True . I hope that this post helps you to understand better.

  • Cross Validation: crossvalidation sklearn .

  • Train and test splits: We split our dataset in to piece were we will use some part of the data to train and other to test or evaluate our model.

Most of this can be done with the popular library sklearn , in the following example:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split

dataset = load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.3, random_state=0)
# No corss validation
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy_score(y_test, predictions)

# With cross validation
model = RandomForestClassifier()
accuracy_scorer = make_scorer(accuracy_score)

scores = cross_val_score(model, X_train, y_train, scoring=accuracy_scorer)
scores.mean()

This example is just a very simple one where the data is processed and simple, also the problem is being solved in most of the cases with a 0.9 accuracy. You will probably have to dive more in order to solve a problem with more columns than just those. My suggestion is to dive in kaggle and look for notebooks or kernels with examples where people process some kind of dataset and obtain a baseline for a given problem and you may learn new topics like OneHotEncoding FeatureExtractions and many more.

Also there ara libraries that do this for you, or automate it and can solve a classification problem, check out MLBlocks or ATM .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM