简体   繁体   中英

What if a categorical column has multiple values in the train set but only one in test data? Would such a feature be useful in model training at all?

I am trying to solve a regression problem, where in one of my features can take up two values ('1','0') in the train set but can be valued only '1' in the test data. Intuitively, including this feature seems wrong to me but I am unable to find a concrete logic to support my assumption.

well, It depends on how many features you have in total. If very few (say less than five), that single feature will most likely play an important role in your classification. In this case, I would say you have "Data Mismatch" problem; meaning that your training and test data are coming from different distributions. One simple way to solve it is to put the two sets together, shuffle the whole set, and split your data again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM