简体   繁体   中英

How to train machine learning model with FastText output

Is there any method of Fasttext by which I can get the following format (<1x10000 sparse matrix of type '<class 'numpy.float64'>'with 67 stored elements in Compressed Sparse Row format>) from the below output of Fasttext or any method by which I can train my ML model. Since when I used TF-IDF then I get the sparse matrix and I trained the ML model but now I want to train the model with FastText.

fasttext_out=model_ted.wv.most_similar("The Lemon Drop Kid , a New York City swindler, is illegally touting horses at a Florida racetrack. After several successful hustles, the Kid comes across a beautiful, but gullible, woman intending to bet a lot of money. The Kid convinces her to switch her bet, employing a prefabricated con. Unfortunately for the Kid, the woman belongs to notorious gangster Moose Moran , as does the money. The Kid's choice finishes dead last and a furious Moran demands the Kid provide him with $10,000  by Christmas Eve, or the Kid won't make it to New Year's. The Kid decides to return to New York to try to come up with the money. He first tries his on-again, off-again girlfriend Brainy Baxter . However, when talk of long-term commitment arises, the Kid quickly makes an escape.")

model_ted.wv.most_similar("school")

Output:

[('Psycho-biddy', 0.9323669672012329),
 ('Slasher', 0.8850599527359009),
 ('Demonic child', 0.8805997967720032),
 ('Giallo', 0.8504119515419006),
 ('Road-Horror', 0.821454644203186),
 ('Anthology', 0.8191317915916443),
 ('Czechoslovak New Wave', 0.8187490105628967),
 ('Supernatural', 0.813347339630127),
 ('Psychological thriller', 0.8018383979797363),
 ('Kitchen sink realism', 0.8017964959144592)]

My main intention is to change the output into vectors and train the Machine Learning model. Please confirm.

My answer to your previous similar question still applies, specifically:

FastText inherently only gives you word-vectors: a vector per word. If you want a vector for a longer run of text, like a lot of words, you'll need to make more decisions about how you want to turn a bunch of individual word-vectors into something else.

It's an OK decision to simply try: averaging all those words together. (There are many other ways to represent larger texts as vectors, or bags-of-other-values.)

You could then try passing those averages as the features to a downstream classifier.

Separately, as also pointed out in that prior answer, you won't get a meaningful set of .most_similar() results if you pass a long string like your example code shows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM