简体繁体中英

How do i retain numbers while preprocessing data using gensim in python?

原文 2021-05-09 13:21:30 4 1 nlp/ gensim/ preprocessor/ lda/ latent-semantic-analysis

I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?

1 answers

You don't have to use simple_preprocess() - it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.

So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a .split() on whitespace.

If you want to look at what simple_preprocess() does, as a model, you can view its Python source at:

https://github.com/RaRe-Technologies/gensim/blob/351456b4f7d597e5a4522e71acedf785b2128ca1/gensim/utils.py#L288

How do I deal with preprocessing and with unseen data in a NLP problem?

Text data preprocessing in python

How to handle URL links in text data while preprocessing data in NLP

How do I find a synonym of a word or multi-word paraphrase using the gensim toolkit

How do I subtract and add vectors with gensim KeyedVectors?

How to find semantic similarity using gensim and word2vec in python

Topic Modeling Using Gensim in Python

Python Gensim: how to calculate document similarity using the LDA model?

How many data is needed to creat useful word vectors using gensim?

How to load sentences into Python gensim?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I deal with preprocessing and with unseen data in a NLP problem? Text data preprocessing in python How to handle URL links in text data while preprocessing data in NLP How do I find a synonym of a word or multi-word paraphrase using the gensim toolkit How do I subtract and add vectors with gensim KeyedVectors? How to find semantic similarity using gensim and word2vec in python Topic Modeling Using Gensim in Python Python Gensim: how to calculate document similarity using the LDA model? How many data is needed to creat useful word vectors using gensim? How to load sentences into Python gensim?

Related Tags

How do i retain numbers while preprocessing data using gensim in python?

Question

1 answers

solution1 1 ACCPTED 2021-05-10 08:21:43

solution1
1 ACCPTED 2021-05-10 08:21:43