简体   繁体   中英

large(over 1 million classes) multi-class classifier via Keras

I have data for around 2 million active customers and around 2-5 years worth of transaction data by customer. This data includes features such as what item that customer bought, what store they bought it from, the date they purchased that item, how much they bought, how much they paid, etc.

I need to predict which of our customers will shop in the next 2 weeks.

Right now my data is set up like this

item_a  item_b  item_c  item_d  customer_id  visit
dates                                             
6/01       1      0      0      0  cust_123      1
6/02       0      0      0      0  cust_123      0
6/03       0      1      0      0  cust_123      1
6/04       0      0      0      0  cust_123      0
6/05       1      0      0      0  cust_123      1
6/06       0      0      0      0  cust_123      0
6/07       0      0      0      0  cust_123      0
6/08       1      0      0      0  cust_123      1
6/01       0      0      0      0  cust_456      0
6/02       0      0      0      0  cust_456      0
6/03       0      0      0      0  cust_456      0
6/04       0      0      0      0  cust_456      0
6/05       1      0      0      0  cust_456      1
6/06       0      0      0      0  cust_456      0
6/07       0      0      0      0  cust_456      0
6/08       0      0      0      0  cust_456      0
6/01       0      0      0      0  cust_789      0
6/02       0      0      0      0  cust_789      0
6/03       0      0      0      0  cust_789      0
6/04       0      0      0      0  cust_789      0
6/05       0      0      0      0  cust_789      0
6/06       0      0      0      0  cust_789      0
6/07       0      0      0      0  cust_789      0
6/08       0      1      1      0  cust_789      1

should I make the target variable be something like

df['target_variable']='no_purchase'
for cust in list(set(df['customer'])):
  df['target_variable']=np.where(df['visit']>0,cust,df['target_variable'])

or have my visit feature be my target variable? If it's the latter, should I OHE all 2 million customers? If not, how should I set this up on Keras so that it classifies visits for all 2 million customers?

I think you should better understand your problem -- your problem requires strong domain knowledge to correct model it, and it can be modeled in many different ways, and below are just some examples:


Regression problem: given a customer's purchase record only containing relative date, eg

  • construct a sequence like [date2-date1, date3-date2, date4-date3, ...] from your data.
  • [6, 7, 5, 13, ...] means a customer is likely to buy things on the weekly or biweekly basis
  • [24, 30, 33, ...] means a customer is likely to buy things on the monthly basis.

If you organize problem in this way, all you need is to predict what is the next number in a given sequence. You may easily get such data by

  1. randomly select a full sequence, say [a, b, c, d, e, f, ..., z]
  2. randomly select a position to predict, say x
  3. pick K (say K=6 ) proceeding sequence [r, s, t, u, v, w] as your network input, and x as your network target.

Once you have this model been trained, your ultimate task can be easily resolved by checking whether the predicted number is greater than 60.


Classification problem : given a customer's purchase record of K months, predict how many purchase will a customer have in the next two month.

Again, you need to create training data from your raw data, but this time the target for a customer is how many items does he purchased in month K+1 and K+2 , and you may organize your input data of K -month record in your own way.

Please note, the number of items a customer purchased is a discrete number, but way below 1M. In fact, like in problem of face image based age estimation, people often quantilize the target into bins, eg 0-8, 9-16, 17-24, etc. You may do the same thing for your problem. Of course, you may also formulate this target as a regression problem to directly predict how many items.


Why you need to know your problem better?

  • as you can see, you may come up a number of problem formulations that might all look reasonable at the first glance or very difficult for you to say which one is the best.

  • it is worthy noting the dependence between a problem set-up and its hidden premise, (you may not notice such things until you think the problem carefully). For example, the regression problem set-up to predict the gap of the next purchase implies that the number of items a customer purchased does not matter. This claim may or may not be fair in your problem.

  • you may come up a much simpler but more effective solution if you know your problem well.

In most of problems like yours, you don't have to use deep learning or at least not at the first place. Classic approaches may work better.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM