简体   繁体   中英

Training spaCy TextCategorizer with data that belongs to no label?

I'm gathering training data for multilabel classification. Some of the data fed into this project will not have enough information to assign it to one of the labels. If I train the model with data that belongs to no label, will it avoid labelling new data that is unclear? Do I need to train it with an "Unclear" label or should I just leave this type of data unlabelled?

I can't seem to find the answer to this question in the spaCy docs.

Assuming you really want multilabel classification, ie an instance can have zero or multiple classes, then it's fine to have some data without any label. If the model performs correctly, it should also predict no label for similar instances. Be careful however that no label doesn't mean unclear for the model, it means that none of the possible classes apply (they are considered independently).

Note that in the case of multiclass classification, ie an instance always has exactly one class, it is impossible to assign no label to an instance. But it would also be suboptimal to create a class 'unclear', because in multiclass classification the model predicts the most likely class, ie relatively to the others. Semantically 'no label' is not a regular label comparable to the others.

Technically this is not a programming question (for future reference, better ask such questions on https://datascience.stackexchange.com/ or https://stats.stackexchange.com/ ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM