简体   繁体   中英

Gensim Word2Vec Vocabulary: Unclear output

I'm starting to get familiar with Word2Vec, but I'm struggeling with a problem and coudln't find something similar... I want to use gensims Word2Vec on an imported PDF document (a book). To import I used PyPDF2 and stored the whole book into a list. Furthermore, I used gensims simple_preprocess in order to preprocess the data. This worked so far, I got the following output:

text=['schottky','diode','semiconductors',...]

So then I tried to use the Word2Vec:

from gensim.models import Word2Vec
model=Word2Vec(text, size=100, window=5, min_count=5, workers=4)
words=list(model.wv.vocab)

but the output was like this:

print(words)
['c','h','t','k','d',...]

I expected also the same words as in the text list and not just some characters. When I tried to find relations between words (eg 'schottky' and 'diode') I got the error-message that none of these words is included in the vocabulary.

My first thought was that the import is wrong, but I got the same result with textract instead of PyPDF2.

Does someone know what's the problem? Thanks for your help!

Appendix:

Importing the book

content_text=[] number_of_inputs=len(os.listdir(path))

    file_to_open=path
open_file=open(file_to_open,'rb')
read_pdf=PyPDF2.PdfFileReader(open_file)
number_of_pages=read_pdf.getNumPages()
page_content=""
for page_number in range(number_of_pages):
    page = read_pdf.getPage(page_number)
    page_content += page.extractText()
content_text.append(page_content)

Instead of
text=['schottky','diode','semiconductors']

Use this
text=[['schottky','diode','semiconductors']]

More info: Gensim word2vec

Word2Vec requires as its sentences parameter a training corpus that is:

  • an iterable sequence (such as a list)
  • where each item is itself a list of string-tokens

If you supply just a list-of-strings, each string is seen as a list-of-one-character-strings, resulting in all the one-letter words you're seeing.

So, use a list-of-lists-of-words, more like:

[
 ['schottky','diode','semiconductors'],
]

(Note also that you generally won't get interesting Word2Vec results on tiny toy-sized data sets of just a few texts and just dozens to hundreds of words. You need many thousands of unique words, across many dozens of contrasting examples of each word, to induce the useful word-vector arrangements that Word2Vec is known for.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM