简体   繁体   中英

Text classification using e1071 (SVM)

I have a dataframe having two columns. One Column contains text. Each row of that column one contains some type of data of three different classes(skill,qualification,experience) and other column is their respective class labels.

Snapshot of the dataframe:

数据框的快照

How to apply svm from package e1071. How to Convert text data Column into some score. I thought of converting the textual column into document-term matrix. Is their any other way? How to make a dt-matrix ?

You can use RTextTools packages to create a document term matrix. Use create_matrix function :

# Create the document term matrix. If column name is v1
dtMatrix <- create_matrix(data["v1"])

Then you can train your SVM model using this:

# Configure the training data
container <- create_container(dtMatrix, data$label, trainSize=1:102, virgin=FALSE)
 
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)

For information, RTextTools user e1071 package internally to train the models.

For more details, please refer the RTextTools and e1071 documentation.

You could use the tm package in R. You will have to preprocess the text before forming the document term matrix which includes - removal of stop words,punctuations, numbers ,normalizations (USA = USA) , stemming etc. add weighting to the dtm - ( tfidf) to add more importance to significant terms.

Once you are done with these steps, you may use the svm() from e1071 to train the classifier

 fit <- svm(x, y, kernel = "linear") 

Here,

  x = dtm 

  y = a vector of the corresponding labels 

Use the model to predict the classes for your test data ( make sure your test data is pre-processed as well)

I also considered using RTextTools . It has a relatively easy implementation. However, it is useless if your data has a class imbalance. It doesn't allow you to control a stratified split in your container.

container <- create_container(dtMatrix, data$label, trainSize=1:102, virgin=FALSE)

You don't know how the proportion of your class labels would end up in "trainSize=1:102" argument. It is also not being maintained. So, I would avoid using it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM