简体   繁体   中英

Sparse Matrix and Dataframe in Python Pandas

I try to replicate this project on Python Binary Classification: Twitter sentiment analysis

The steps are those:

Step 1: Get data
Step 2: Text preprocessing using R
Step 3: Feature engineering
Step 4: Split the data into train and test
Step 5: Train prediction model
Step 6: Evaluate model performance
Step 7: Publish prediction web service

I am on Step 4 now, but I think I cannot continue.

import pandas
import re
from sklearn.feature_extraction import FeatureHasher

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn import cross_validation

#read the dataset of tweets

header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.1600000.processed.noemoticon.csv",names=header_row)

#keep only the right columns

train = train[["sentiment","text"]]

#remove puctuation, special characters, numbers and lower case the text

def remove_spch(text):

    return re.sub("[^a-z]", ' ', text.lower())

train['text'] = train['text'].apply(remove_spch)


#Feature Hashing

def tokens(doc):
    """Extract tokens from doc.

    This uses a simple regex to break strings into tokens.
    """
    return (tok.lower() for tok in re.findall(r"\w+", doc))

n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])

#Feature Selection and choose the best 20.000 features using Chi-Square

X_new = SelectKBest(chi2, k=20000).fit_transform(X, train['sentiment'])

#Using Stratified KFold, split my data to train and test

skf = cross_validation.StratifiedKFold(X_new, n_folds=2)

I am sure that the last line is wrong since it contains only the 20.000 features and not the Sentiment Column from Pandas. How can I "join" the Sparse matrix X_new with the Dataframe train , to include it on the cross_validation and then use it to a classifier?

You should pass your classes labels into StratifiedKFold, and then use skf as iterator, at each iteration it will yield indexes for test set and train set, you can use them to separate dataset.

Look at code example at official scikit-learn documentation: StratifiedKFold

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM