Sci-Kit Learn: Investigating Incorrectly Classified Data

Question

I want to analyze data that has been incorrectly classified by a model using sci-kit learn, so that I can improve my feature generation. I have a method for doing this, but I am both new to sci-kit learn and pandas, so I'd like to know if there is a more efficient/direct way to accomplish this. It seems like something that would be part of a standard workflow, but in the research I did, I didn't find anything directly addressing this backwards mapping from model classification, through the feature matrix, to the original data.

Here's the context/workflow i'm using, as well as the solution i've devised. below that is sample code.

Context. My workflow looks like this:

Start with a bunch of JSON blobs, the raw data. this is pandas DataFrame.
Extract relevant pieces for the modeling, call this the data. this is a pandas Dataframe.
In addition, we have truth data for all the data, so we'll call that truth or y.
Create a feature matrix in sci-kit learn, call this X. This is a large sparse matrix.
Create a random forest object, call this forest.
Create random subsets of the feature matrix for training and test using sci-kit learn split_train_test() function.
Train the forest on the training data above, X_train, which is a large sparse matrix.
Get the indices of the false positive and false negative results. These are indices into X_test, a sparse matrix.
Go from the a false positive index into X_test back to the original data
Go from the data to the raw data, if necessary.

Solution.

Pass an index array into the split_test_train() function, which will apply the same randomizer on the index array and return it as index for train and test data (idx_test)
Gather the indices of the false positives and false negatives, these are nd.arrays
use these to look up the original place location in the index array, eg, index=idx_test[false_example] for false_example in false_neg array
use that index to look up the original data, data.iloc[index] is original data
then data.index[index] will return the index value into the raw data, if needed

Here's code associated with an example using tweets. Again, this works, but is there a more direct/smarter way to do it?

# take a sample of our original data
data=tweet_df[0:100]['texts']
y=tweet_df[0:100]['truth']

# create the feature vectors
vec=TfidfVectorizer(analyzer="char",ngram_range=(1,2))
X=vec.fit_transform(data) # this is now feature matrix

# split the feature matrix into train/test subsets, keeping the indices back into the original X using the
# array indices
indices = np.arange(X.shape[0])
X_train, X_test, y_train, y_test,idx_train,idx_test=train_test_split(X,y,indices,test_size=0.2,random_state=state)

# fit and test a model
forest=RandomForestClassifier()
forest.fit(X_train,y_train)
predictions=forest.predict(X_test)

# get the indices for false_negatives and false_positives in the test set
false_neg, false_pos=tweet_fns.check_predictions(predictions,y_test)

# map the false negative indices in the test set (which is features) back to it's original data (text)
print "False negatives: \n"
pd.options.display.max_colwidth = 140
for i in false_neg:
    original_index=idx_test[i]
    print data.iloc[original_index]

and the checkpredictions function:

def check_predictions(predictions,truth):
    # take a 1-dim array of predictions from a model, and a 1-dim truth vector and calculate similarity
    # returns the indices of the false negatives and false positives in the predictions. 

    truth=truth.astype(bool)
    predictions=predictions.astype(bool)
    print sum(predictions == truth), 'of ', len(truth), "or ", float(sum(predictions == truth))/float(len(truth))," match"

    # false positives
    print "false positives: ", sum(predictions & ~truth)
    # false negatives
    print "false negatives: ",sum( ~predictions & truth)
    false_neg=np.nonzero(~predictions & truth) # these are tuples of arrays
    false_pos=np.nonzero(predictions & ~truth)
    return false_neg[0], false_pos[0] # we just want the arrays to return

Answer 1

Your workflow is:

raw data -> features -> split -> train -> predict -> error analysis on the labels

There is row-for-row correspondence between the predictions and the feature matrix, so if you want to do error analysis on the features, there should be no problem. If you want to see what raw data is associated with errors, then you have to either do the split on the raw data, or else track which data rows mapped to which test rows (your current approach).

The first option looks like:

fit transformer on raw data -> split raw data -> transform train/test separately -> train/test -> ...

That is, it uses fit before splitting and transform after splitting, leaving you with raw data partitioned in the same way as the labels.

Sci-Kit Learn: Investigating Incorrectly Classified Data

Question

1 answers

solution1
0 ACCPTED 2015-12-31 22:10:57

Sci-Kit Learn: Investigating Incorrectly Classified Data

Question

1 answers

solution1 0 ACCPTED 2015-12-31 22:10:57

solution1
0 ACCPTED 2015-12-31 22:10:57