I try to replicate this project on Python Binary Classification: Twitter sentiment analysis
The steps are those:
Step 1: Get data
Step 2: Text preprocessing using R
Step 3: Feature engineering
Step 4: Split the data into train and test
Step 5: Train prediction model
Step 6: Evaluate model performance
Step 7: Publish prediction web service
I am on Step 4
now, but I think I cannot continue.
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.1600000.processed.noemoticon.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
#Feature Selection and choose the best 20.000 features using Chi-Square
X_new = SelectKBest(chi2, k=20000).fit_transform(X, train['sentiment'])
#Using Stratified KFold, split my data to train and test
skf = cross_validation.StratifiedKFold(X_new, n_folds=2)
I am sure that the last line is wrong since it contains only the 20.000 features and not the Sentiment
Column from Pandas. How can I "join" the Sparse matrix X_new
with the Dataframe train
, to include it on the cross_validation
and then use it to a classifier?
You should pass your classes labels into StratifiedKFold, and then use skf as iterator, at each iteration it will yield indexes for test set and train set, you can use them to separate dataset.
Look at code example at official scikit-learn documentation: StratifiedKFold
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.