简体   繁体   中英

One Hot Encoding for representing corpus sentences in python

I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term.

basically, the idea is similar as below:

  • 1000000 Sunday; 0100000 Monday; 0010000 Tuesday; ... 0000001 Saturday;

if the corpus only have 7 different words, then I only need a 7-digit vector to represent every single word. and then, a completed sentence can be represented by a conjunction of all the vectors, which is a sentence matrix. However, I tried in Python, it seems not working...

How can I work this out? my corpus have a very large amount of different words.

Btw, also, seems like if the vectors are mostly fulfilled with zeros, we can use Scipy.Sparse to make the storage small, for example, CSR.

Hence, my entire question will be:

how the sentences in corpus can be represented by OneHotEncoder, and stored in a SparseMatrix?

Thank you guys.

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix.

Example code for two simple documents AB and BB :

from sklearn.preprocessing import OneHotEncoder
import itertools

# two example documents
docs = ["A B", "B B"]

# split documents to tokens
tokens_docs = [doc.split(" ") for doc in docs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_docs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_doc] for tokens_doc in tokens_docs]

# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)

print X.toarray()

Prints (one hot vectors in concatenated form per document):

[[ 1.  0.  0.  1.]
 [ 0.  1.  0.  1.]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM