简体   繁体   中英

which kind of features are good to extract from text for author identification calssifying

I want to classify texts as their authors for the author identification task...
the features are may:
the author's text length
or the authors text lexical features... is there anybody to help that which kind of features can help to improve classification results? the sample data frame I gathered is like this...
在此处输入图片说明

text long is 4 sentences, and I have 18 authors at least, about classification, this task is my thesis and I can not "just" apply classification on text, the goal is to apply classification into features that are extracted from text... I want to know which kind of features can help me to improve classification accuracy...( with both mo approaches or neural networks

How long are your texts? You can try deriving tf-idfs for each document, and then perform a knn search over your dataset. A more sophisticated way it's to featurize your texts with a neural network, and then perform a knn by using those vectors. If your dataset is big enough, there are not so many authors and there are several texts for each author, you could try to fine-tune a neural network to classify your texts. But I would go for the knn over the neural net features.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM